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1*  Overview  and  Summary 

1.1  Scope  of  this  Report 

This  document  reports  the  research  activities  and  results  for  the  period 
16  April  1983  to  15  October  1983  under  the  Defense  Advanced  Research 
Project  Agency^CARPA)  Submicron  Systems  Architecture  Project.. 

1.2  Objectives  _ 

The  central  theme  of  this  research  is  the  architecture  and  design  of  VLSI 
systems  appropriate  to  a  microcircuit  technology  scaled  to  submicron 
feature  sizes,  and  includes  related  efforts  in  concurrent  computation  and 
VLSI  design.  Additional  background  information  can  be  found  in  previous 
semiannual  technical  reports  [5052:TR:82,  5078:TR:83]. 

1.3  Highlights 

The  highlights  of  the  previous  6  months  are: 

(1)  The  cosmic  cube,  a  64-element  experimental  homogeneous  machine,  was 
completed  and  is  now  in  regular  use.  Benchmarks  of  this  system  show  it 
outrunning  a  VAX1 1/780  by  a  factor  of  6  on  two  large  regular  computations. 
Numerous  other  application  programs  are  in  progress  for  this  machine.  A 
new  operating  system,  the  cosmic  kernel,  has  been  defined  and  its  code  is 
being  written. 

(2)  Prototype  mosaic  processors  have  been  packaged  with  fast  off-chip 
storage  for  a  small  mosaic-tree  for  software  experiments.  Meanwhile,  the 
efforts  in  the  design  of  the  single  chip  mosaic  element  are  approaching 
completion,  with  the  storage  section  designed  and  several  processor 
improvements  accomplished,  including  interrupts,  a  multiply  intruction,  a 
faster  control  PL A,  and  new  microcode. 

(3)  The  algorithms  for  and  logical  design  of  the  super  mesh  element  are 
complete  and  have  been  simulated  with  MOSSIM. 

(4)  FMOSSIM,  a  concurrent  fault  simulator  for  MOS  digital  systems,  is  now 
operational.  This  program  uses  the  same  switch-level  representation  of 
MOS  circuits  as  the  logic  simulator  MOSSIM  II,  and  so  can  model  such  MOS 
circuit  structures  as  (bidirectional)  pass  transistors,  static  and 
precharged  logic,  busses,  and  both  static  and  dynamic  memory.  The 
concurrent  simulation  techniques  of  FMOSSIM  simultaneously  models  the  good 
circuit  and  a  large  number  of  faulty  circuits,  and  consequently  requires 
much  less  CPU  time  than  simple  serial  fault  simulation. 


2.  ARCHITECTURAL  EXPERIMENTS 


He  have  three  architectural  experiments,  Cosmic  Cube,  Mosaic,  and  Super 
Mesh,  la  various  phases  of  desiga,  coastructloa,  programming,  and  use. 
These  machines  are  all  ensembles  of  Identical,  concurrently  operating,  aad 
regularly  Interconnected  elements  that  communicate  by  message  passing 
[5102:TR:83] .  Our  priority  in  these  efforts  has  been  to  apply  VLSI 
technology  to  achieve  substantial  advances  In  cost /performance  in  a 
limited  set  of  computationally  demanding  tasks. 

These  experiments,  the  machine  organizations,  software  systems,  current 
status,  and  application  span,  can  be  summarized  as  follows: 

2.1  Cosmic  Cube 

(W  C  Athas,  Reese  Fawcette,  Mike  Newton,  Chuck  Seitz) 

Cosmic-cube  is  an  experimental  homogeneous  machine  with  elements 
Interconnected  in  a  Boolean  n-cube.  Cosmic  elements  are  of  medium  size 
for  this  class  of  machine,  about  140  MSL,  and  consist  of  an  Intel  8086 
processor  with  8087  floating  point  coprocessor,  128K  bytes  of  primary 
storage,  8K  bytes  of  read-only  storage  for  initialization,  bootstrap,  and 
diagnostic  programs,  and  6  bidirectional  self-timed  communication 
channels. 

At  140  million  square  lambda  (MSL)  complexity,  and  78  "off  the  shelf" 
chips,  the  nodes  of  this  machine  are  considered  to  be  a  hardware 
simulation  of  a  node  element  that  could  be  made  as  a  single  chip  with  1 
micron  MOS  technology.  In  anticipation  of  this  advanced  process 
technology,  we  have  built  this  system  in  order  to  experiment  with  the 
applications,  algorithms,  and  programming  of  such  systems. 

2.1.1  Curreat  Status  of  the  Hardware 

Construction  of  the  6-cube  (64  element)  machine  is  now  complete,  and  the 
system  Is  in  regular  use.  This  machine  has  been  completed  and  tested  in 
stages  of  3-,  4-,  5-,  and  6-cubes  in  June,  July,  August,  and  September 
1983,  respectively,  as  aode  elements  have  been  checked  out. 

A  2-cube  (4  element)  prototype  has  been  running  concurrent  programs  since 
July  1982,  and  has  been  used  for  software  development.  He  are  also 
operating  an  Independent  3-cube  machine  for  system  software  development. 

The  Cosmic  6-cube  elements  are  running  at  a  clock  rate  of  4.1  MHz,  reduced 
from  the  Interim  design  point  of  5.0  MHz,  due  to  speed  problems  In  the 
Intel  8087  floating  point  coprocessor.  As  soon  as  all  the  8087's  are 
replaced  by  the  -3  version,  the  system  should  operate  at  5  MHz.  Except  for 
the  8087's,  the  system  operates  at  up  to  8  MHz.  Accordingly,  our  current 
benchmarks  can  be  expected  to  improve  by  a  factor  as  much  as  2  over  the 
next  year  when  faster  8087's  become  available,  and  due  to  an  Improved  code 
optimizer. 

Under  separate  support  (principally  DoE),  production  of  about  200  nodes  of 
a  descendent  of  the  cosmic  cube  design  is  under  way  at  Caltech  JFL,  in 


\  % 


order  to  provide  additional  cycles  for  scientific  users  in  the  Caltech 
concurrent  computation  project.  These  nodes  are  software  compatible  with 
the  cosmic  6-cube»  and  will  assembled  into  a  Boolean  7-cube  (128  element) 
system  and  several  smaller  systems. 


2.1.2  Application  Programs  and  Benchmarks 


An  SU3  lattice  gauge  theory  computation,  an  adaptation  of  a  computation 
that  had  been  run  for  about  1000  hours  on  the  original  2-cube,  is  being 
used  to  test  the  6-cube.  This  program,  an  investigation  of  the  properties 
of  protons  predicted  by  the  quantum  chromodynamics  theory,  has  now  run  for 
about  100  hours,  and  is  producing  successively  more  and  more  refined 
statistics.  It  will  run  for  several  hundred  more  hours  before  improving 
significantly  on  the  best  existing  results  obtained  in  about  40  hours  on  a 
Cray-l. 


A  LaPlace  equation  demonstration  program  that  illustrates  the  relaxation 
solution  in  the  progress  of  the  computation  is  has  been  refined  into  a 
highly  efficient  and  general  program  for  differential  equation  solution  by 
relaxation  methods. 


Both  of  these  physics  programs  use  substantially  all  of  the  node  storage, 
and  benchmark  at  6:7  times  the  VAX1 1/780  on  the  present  machine.  At 
8  MHz  clock  and  by  using  a  new  code  optimization  package,  we  expect  the 
Cosmic  6-cube  to  achieve  more  like  15  times  the  VAX1 1/780  for  these  regular 
computations,  or  easily  in  excess  of  0.1  of  a  Cray-1. 


There  are  numerous  other  application  programs  under  development,  the  most 
interesting  of  which  is  a  MQS-VLSI  circuit  simulator.  The  formulation 
that  is  used  to  achieve  concurrency  is  a  row  partitioning  of  a  modified 
nodal  admittance  matrix  into  concurrent  processes  [Mattlsson  5096:DF:83]. 

A  simulator  working  on  this  principle,  written  in  Pascal,  and  running  on  a 
VAX,  is  the  testbed  for  this  program  that  will  be  transferred  to  the 
6-cube  this  spring.  The  simulation  formulation  is  described  in  more 
detail  in  section  4.2. 


2.1.3  Software  Status 


The  period  of  bringing  up  the  6-cube  was  one  in  which  a  large  suite  of 
testing  and  diagnostic  programs  have  been  written  and  refined.  These 
programs  are  largely  routine.  The  lowest  level  tests,  such  as  the  RAM 
test,  are  coded  in  8086  assembly  code,  while  the  communication  and 
floating  point  tests  are  coded  in  C. 


The  mature  software  tools  for  application  programming  of  the  Cosmic  Cube 
now  Include  a  full  Initialization,  bootstrap,  and  disagnostlc  package,  a  C 
and  Unix  based  environment  that  is  widely  used  for  the  more  "crystalline" 
applications,  and  a  complete  Unix  based  simulator  for  programs  written  in 
this  environment. 


A  prototype  message  passing  and  routing  multiple  process  operating  system 
called  the  "cosmic  kernel"  [5095:DF:83J ,  has  been  defined  and  is  being 
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coded  and  debugged.  Since  this  will  be  the  environment  seen  by  people 
that  might  want  to  do  software  experiments  with  the  Cosmic  Cube  after  it 
Is  made  available  as  a  network  server,  an  outline  of  the  principal 
functions  provided  by  and  computational  model  Imposed  by  the  cosmic  kernel 
(CK)  Is  Included  In  the  concurrent  computation  section  3.2  below. 


2.2  Mosaic  Systems 

(Chris  Lutz,  Steve  Rabin,  Chuck  Seitz) 

Mosaic  Is  another  experimental  homogeneous  machine,  but  with  very  small 
node  elements,  by  the  plan  of  this  experiment,  a  single  chip.  This 
element  consists  of  a  mosaic  processor  with  4  Input  and  4  output  ports 
(2.5  MSL)  and  as  much  primary  storage  as  a  single  chip  permits.  For 
example,  a  6  mm  square  chip  in  3  micron  MOSIS  nMOS  technology,  4000  lambda 
by  4000  lambda,  16  million  square  lambda  (MSL),  will  accomodate  a  Mosaic 
processor,  4K  bytes  of  RAM  (32  bytes  of  which  are  "maimed"  to  provide  a 
small  Initialization  and  bootstrap  loader),  and  the  small  number  of  pads 
required  for  this  element. 

Mosaic  elements  can  be  interconnected  In  a  variety  of  communication  plans. 
Including  a  tree,  mesh,  shuffle-exchange  graph,  chordal  ring,  or  cube 
connected  cycle. 

A  paper  on  the  design  of  the  Mosaic  element,  to  be  published  In  the 
Proceedings  of  the  MIT  Conference  on  Advanced  Research  In  VLSI,  January 
1984,  Is  Included  as  Appendix  A  of  this  report,  and  Is  also  available  as  a 
Caltech  technical  report  [5093:TR:83] . 

MOSIS  has  fabricated  a  run  of  48  prototype  Mosaic  processors  for  us,  on 
which  we  got  a  56Z  yield  (27  working  processors).  These  processors  are 
being  packaged  on  PCBs  (designed  with  Earl  and  fabricated  through  the 
MOSIS  prototype  PCB  service)  with  fast  (InMDS)  off-chip  storage  In  order 
to  make  a  working,  programmable,  and  expandable  15  element  Mosaic-tree  or 
16-element  Mosaic-shuffle.  These  machines  will  be  used  for  software 
development  while  we  go  through  the  logistics  of  building  larger  and  more 
highly  Integrated  versions  of  Mosaic  systems.  This  staging  tactic  worked 
very  well  for  the  cosmic  cube  project. 

An  Improved  version  of  the  processor  with  interrupts,  a  multiply 
Instruction,  a  faster  control  PLA,  and  new  microcode  has  been  designed, 
but  not  yet  verified.  RAM  test  chips  have  been  fabricated  and  tested.  A 
full-size  RAM  element  has  been  designed  and  verified,  and  is  about  ready 
to  be  sent  to  MOSIS  for  fabrication  and  test.  Thus  we  believe  we  are  very 
close  to  assembling  a  complete  Mosaic  element,  a  single  chip  with  about 
140,000  transistors. 

This  project  Involves  a  number  of  supporting  efforts  in  testing  to  allow 
production  of  these  chips  in  quantities  of  several  thousands.  We  are 
putting  a  wafer-stepping  probe  station  into  operation  In  preparation  for 
testing  Mosaic  elements  on  the  wafers. 


It  is  our  Intention  in  the  runs  of  Mosaic  element  chips,  both  in  an  early 
run  of  about  20  wafers,  and  in  a  run  later  of  sufficient  wafers  to  yield 
about  1500  working  chips,  to  work  with  Martin  Buehler  at  JPL  in 
correllating  test  strip  results  die  by  die  with  functional  test  results. 

It  is  our  plan  to  have  a  1024-element  Mosaic  system  running  in  June  1985. 

A  1024-element  Mosaic  system  is  expected  to  be  capable  2,500  Million 
instructions  per  second  on  combinatorial  problems,  or  of  20:80  Million 
32-blt  mantissa  floating  point  operations  per  second  —  essentially  Cray-1 
performance  —  on  a  limited  class  of  matrix,  grid  point,  and  finite 
element  computations.  It  is  expected  to  exhibit  a  factor  of  about  10  in 
cost/performance  over  some  of  the  most  regular  computations  that  can  be 
performed  on  cosmic  cube,  and  that  do  not  require  large  amounts  of  storage 
per  node,  and  a  factor  of  about  100  in  cost/performance  over  conventional 
mainframes  for  this  limited  set  of  problems. 

Discussions  of  algorithms  and  programming  systems  for  Mosaic  are  deferred 
to  the  concurrent  computation  section  below. 

2 . 3  Super-mesh 

(Wen  King  Su,  Chuck  Seitz) 

Super-mesh  is  a  serial  communication,  serial  floating  point  arithmetic, 
SIMD  machine  in  the  early  stages  of  design.  Its  rationale  was  discussed 
in  some  length  in  our  previous  semi-annual  report  [5078:TR:83] .  This 
machine  might  be  regarded  as  a  shared  control  implementation  of  a 
computational  or  systolic  array. 

The  arithmetic  algorithms  and  logic  design  for  the  super-mesh  node  are  now 
complete,  and  the  node  fully  simulated  with  M0SSIM. 

The  most  substantial  change  made  in  the  course  of  the  design  from  the 
plans  previously  reported  is  a  decision  to  use  a  64-bit  floating  point 
format -with  a  56-bit  mantissa  and  8-blt  exponent. 

Based  on  early  and  partial  layouts  of  the  arithmetic  slice  and  registers, 
the  elements  meet  our  previous  size  expectations,  as  scaled  by  the  change 
in  word  size,  to  be  about  2  MSL.  Each  element  contains  40  registers, 
serial  floating  point  arithmetic,  neighbor  communication,  and  the  serial 
microcode  receiver  and  pipeline.  An  instruction  cycle  of  this  machine 
requires  65  clock  cycles.  Since  the  serial  carry-save  arithmetic 
algorithms  use  only  short  combinational  paths,  we  expect  to  dock  this 
chip  at  20  MHz,  and  achieve  a  floating  point  rate  of  0.3  Mflops  per 
element  or  1.2  Mflops  per  chip  with  4  elements  per  chip. 

A  microcode  control  word  is  transmitted  serially  for  each  instruction 
cycle.  The  physical  design  of  this  machine  employs  a  deliberate  skew  in 
the  Internode  communication  and  instruction  broadcast  to  allow  it  to  be 
extended  to  any  size,  but  its  interconnection  is  limited  to  a  mesh. 


I 

I 

2.4  Designs  for  Advanced  Technology  Hoaiogeneous  Machines 
(Chuck  Seitz) 

I  A  number  of  designs  for  an  advanced  technology  homogeneous  machine  node 

j  element  are  being  developed,  with  the  following  characteristics: 

(1)  1  Mbyte  of  storage  per  node,  with  error  correction.  Implemented  with 
(40)  256K  dRAM  chips. 

'  (2)  2  32-bit  processors,  one  for  communication  and  operating  system 

I  functions,  and  the  second  for  task  processing  (Including  fast  floating 

point  arithmetic),  share  the  Mbyte  of  storage.  Either  the  M68010  or  DEC 
microVAX  are  possibilities  for  the  processors. 

i 

(3)  The  communication  section  will  be  based  in  an  evolution  of  the  Fifo 
'  Buffered  Tranceiver  (PBT)  chip  previously  reported  [Ng  5055:TR:82].  The 

|  node  will  support  up  to  12  serial  channels,  which  would  allow  up  to  4096 

element  Boolean  n-cube  machines,  and  an  additional  channel  for  host,  1/0, 
or  secondary  storage  connections. 


(4)  The  task  section  will  use  a  floating  point  coprocessor  with  a  floating 
point  rate  of  1  Mflop  with  "short"  floating  point  words. 


3.  CONCURRENT  COMPUTATION 


3.1  Concurrent  Algorithms 
(Lennart  Johns son) 

In  the  search  for  efficient  algorithms  for  ensemble  architectures,  a  few 
algorithms  for  sorting  on  binary  trees  and  Boolean  n-cubes,  and  for 
solving  trldlagonal  linear  systems  of  equations  on  n-cubes  have  been 
devised  [5085:DF:83] .  The  algorithms  are  totally  distributed  (as  are  the 
data  structures). 

Bitonlc  sort  can  be  performed  on  a  perfect  shuffle  network  in 
(logN)*(logN)  time,  if  there  are  one  element  per  node.  With  one  element 
per  node  n  a  Boolean  n-cube  the  order  of  the  time  complexity  is  the  same, 
but  the  constant  can  be  improved  somewhat.  With  several  elements  per  node 
a  combination  of  sequential  sort  and  parallel  sort  is  obviously  necessary. 
A  few  algorithms  have  been  devised  for  different  sorting  orders,  and  with 
different  combinations  of  sequential/parallel  sort.  Some  of  the 
algorithms  exhibit  a  gradual  change  of  behavior  from  efficient  sequential 
sort  when  only  one  processor  is  available  to  a  Bltonic  sort  when  there  is 
as  many  processors  as  elements  to  be  sorted.  All  nodes  execute  the  same 
program.  The  control  is  entirely  local. 

The  trldlagonal  system  solver  devised  for  the  n-cube  solves  the  system  in 
logN  time  if  the  cube  is  large  enough  that  one  dimension  of  the  system  to 
be  solved  can  be  identified  with  one  node  of  the  cube.  If  the  system  is 
larger  than  that,  then  the  execution  time  grows  lineraly,  as  on  a 
sequential  machine.  The  control  is  again  entirely  local,  but  somewhat 
more  complex  than  in  the  sorting  algorithms.  The  local  control  sequence 
can  be  derived  from  generators  of  a  Gray  code.  What  is  effectively  needed 
is  to  make  successive  linear  embeddings  in  the  cube,  where  the  nodes  in 
each  embedding  consists  of  nearest  neighbors  in  the  cube,  and  the 
succession  of  paths  to  be  embedded  are  obtained  by  deleting  every  other 
node  in  the  previous  path. 


3.2  Cosmic  Kernel 

(W  C  Athas,  Reese  Fawcette,  Chuck  Seitz) 

The  following  is  an  outline  of  the  principal  functions  provided  by  and 
computational  model  Imposed  by  the  cosmic  kernel  (CK),  a  small  operating 
system  kernel  being  developed  for  the  cosmic  cube. 

One  copy  of  the  cosmic  kernel  (CK)  resides  in  each  node  of  the  cosmic  cube 
(CC),  and  all  of  these  copies  are  concurrently  executable.  Some  operating 
system  functions  are  supported  also  in  the  CC  intermediate  host  (IH). 

When  running  with  this  operating  system,  the  IH  does  not  run  user  code. 

It  is  dedicated  to  operating  system  and  netvork  functions,  and  the  CC 
operates  as  a  network  server. 


The  kernel  has  two  layers.  All  chose  pares  of  Che  kernel  with  which  one 
communlcaCes  by  system  calls  are  In  Che  "Inner  kernel"  (IK).  The  Inner 
kernel  conCalns  all  message  sending  and  receiving,  message  rouCing,  and 
process  scheduling. 

All  ocher  kernel  funcclons  are  Invoked  noc  by  a  system  call  mechanism,  but 
by  sending  messages  Co  a  set  of  processes  called  the  "outer  kernel"  (OK). 
The  outer  kernel  provides  capabilities  of  host  I/O  and  of  process  creation 
and  destruction. 

The  basic  unit  of  the  computations  supported  by  CK  is  a  "process."  A 
single  node  of  CC  may  contain  many  processes.  A  computation  consists  of  a 
collection  of  processes  distributed  through  the  CC  that  can  be  thought  of 
as  all  executing  concurrently,  either  by  virtue  of  being  in  different 
physical  nodes  of  CC,  or  by  being  interleaved  In  execution  within  a  single 
node.  Processes  communicate  by  sending  and  receiving  messages. 

The  placement  of  processes  in  physical  nodes  of  CC  can  be  controlled  by 
the  programmer,  or  may  be  deferred  to  a  library  process.  This  placement 
does  not  influence  the  logic  of  the  program,  but  will  have  consequences  in 
(1)  the  possibility  of  exceeding  available  storage,  (2)  the  influence  of 
process  placement  on  performance  through  the  overhead  in  message  routing 
and  the  competition  for  cycles  amongst  processes  in  a  single  node. 

As  far  as  the  kernel  is  concerned,  a  process  Is  a  segment  of  sequential 
code  and  data  of  fixed  size.  This  code  and  data  is  represented  for 
communication  purposes  as  a  binary  image  relocatable  by  the  segmentation 
features  of  the  8086  processor.  Process  code  must  be  dynamically 
relocatable;  it  must  not  load  or  manipulate  the  code  segment  (CS),  data 
segment  (DS),  or  stack  segment  (SS)  registers,  and  must  maintain  a  stack 
with  sufficient  space  for  storing  state  in  an  Interrupt.  The  code,  data, 
and  stack  segments  are  each  limited  to  64K  bytes. 

The  code  for  a  process  is  written  In  a  suitable  programming  notation,  such 
as  extended  versions  of  Pascal,  C,  or  8086  Assembly,  and  compiled 
Independently  of  other  processes  that  may  be  a  part  of  the  same 
computation.  Because  of  the  independence  of  the  construction  of  process 
code,  and  the  standardization  of  kernel  functions,  there  Is  complete  and 
uniform  compatlbllty  of  processes  Independent  of  source  language.  For 
example,  a  library  process  written  in  C  or  assembly  code  can  be  used  in  a 
computation  In  which  most  of  the  processes  were  written  in  Pascal. 

Each  process  has  a  unique  16-bit  identifier  that  is  an  ordered  pair: 
process  Id  *  <physlcal  node,  process  number  within  the  node>.  This  Id  Is 
normally  represented  in  a  single  16  bit  word,  in  which  the  physical  node 
has  a  range  0:255  and  the  process  number  a  range  0:255. 

Because  the  physical  location  of  a  process  is  Imbedded  in  its  Id,  CK  does 
not  maintain  a  map  from  process  id  to  physical  node.  Message  routing  to 
process-  8  Is  based  simply  on  the  physical  address  part  of  the  destination 
found  in  the  message  header.  Thus  we  assume  that  a  process,  once  created. 
Is  not  relocated. 
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CK  supports  one  message  format*  All  messages  have  headers.  Long  messages 
are  communicated  over  the  physical  channels  of  CC  by  different  protocols 
than  are  used  for  short  messages,  but  this  difference  is  invisible  to  user 
programs. 

The  message  header  can  be  thought  of  as  the  envelope  in  which  a  message  is 
sent.  The  header  is  A  words  long,  and  contains 

word  1 :  id  of  the  destination  process 
2:  id  of  the  sending  process 
3:  message  type 
4:  message  length 

Words  1  and  2  are  self-explanatory.  The  16-bit  type  is  significant  in  the 
way  in  which  messages  sent  are  matched  against  messages  expected  by  the 
destination  process.  The  message  length  is  specified  in  number  of  words, 
and  may  be  zero.  A  message  of  length  0  is  called  a  "synchronization 
message,"  and  conveys  Information  only  through  the  type  and  by  its 
existence. 

A  process  performs  message  sending  and  receiving  by  system  calls.  The 
system  call  is  Implemented  on  the  cosmic  node  as  a  software  Interrupt. 

For  present  purposes  calls  will  be  described  by  a  name  followed  by 
parameters,  if  any.  In  C  or  Pascal  source,  system  calls  may  be  either 
calls  to  an  external  procedure  that  includes  the  system  call,  or  may  be 
compiled  directly  into  the  suitable  system  call. 

The  two  basic  message  calls  are  SEND  and  RECV.  These  calls  specify  a 
communication  request  that  will  be  satisfied  as  soon  as  CK  is  able.  CK 
buffers  messages  in  transit,  up  to  and  including  complete  messages.  It  is 
perfectly  legal  to  SEND  a  message  before  a  corresponding  RECV  is  executed; 
such  a  message  will  be  queued  in  the  destination  node  and  in  transit  as 
storage  space  allows.  CK  also  supports  a  PROBE  call  that  checks  for  the 
presence  of  a  message  queued  for  a  process,  and  an  UNRECV  call  that 
will  undo  a  pending  RECV.  Details  of  these  functions  are  described  in 
[5095:DF:83] . 

The  outer  kernel  (OK)  is  a  priviledged  set  of  processes  whose  functions 
are  Invoked  by  messages  rather  than  system  calls.  User  processes  will 
normally  communicate  with  their  own  OK  processes,  but  may  equally 
communicate  with  any  other  OK  processes. 

If  we  might  be  allowed  to  Indulge  briefly  here  in  design  philosophy,  let 
us  remark  that  any  future  evolution  of  CK  is  seen  as  occurring  in  the 
outer  kernel.  The  number  of  functions  of  the  inner  kernel,  accordingly 
its  size,  complexity,  and  difficulty  in  portability,  has  been  guarded 
fairly  closely.  These  functions  are  very  close  to  machine  intrinsics,  and 
may  well  guide  the  development  of  future  node  architectures.  The  outer 
kernel  is  meant  to  be  more  nearly  open-ended,  machine  independent,  and 
accessible  to  change. 


The  set  of  process  ids  <*,255>,  <*,254>,  ...  is  reserved  for  the  OK. 
Processes  <*,254>  have  capabilities  of  performing  I/O  vlth  hosts,  and  so 
are  denoted  <*,host>.  Processes  <*,255>  have  capabilities  of  creating  and 
destroying  processes,  and  the  associated  capability  of  performing  storage 
management  within  a  own  node,  and  are  denoted  <*,spawn>. 

3.3  Mosaic  Software 

3.3.1  Scheduler 

(Pey-yun  Peggy  Li,  Lennart  Johnsson) 

A  Scheduler  has  been  implemented  in  Mosaic  Assembly  Language  and  tested 
under  the  Simulator.  The  scheduler  keeps  track  of  the  states  of  the 
processes,  i.e.,  RUN,  READY,  SUSPEND  and  SLEEP  states.  The  state 
transitions  are  triggered  either  by  the  running  process  while  an  I/O 
operation  falls  or  by  the  scheduler  while  it  receives  a  message  at  one 
input  port.  The  scheduler  occupies  387  words  of  memory  and  one  Process 
Control  Block  takes  30  words.  The  time  to  perform  a  context  switch  (swap 
in  and  swap  out)  and  inspect  all  the  three  input  ports  is  about  150 
instruction  cycles. 

3.3.2  Tree  Downloader 

(Pey-yun  Peggy  Li,  Lennart  Johnsson) 

A  Multi-node  Downloader  for  a  Mosaic-tree  with  the  mapping  algorithm 
[5084:TR:83]  Implemented  in  it  has  been  written  and  tested.  The 
downloader  can  load  a  fixed  number  of  process  programs  into  one  Mosaic 
element.  That  number  is  furnished  by  the  host  and  propagated  through  the 
entire  tree.  The  downloader  has  three  parts,  initiation,  type  loading  and 
name  loading.  The  Initiation  part  creates,  allocates  and  Initiates  the 
process  control  blocks  for  the  fixed  number  of  processes. 

The  type  loading  part  loads  the  proper  number  of  node  types  into  each 
node,  and  the  code  loading  part  loads  the  relocatable  program  code  of  all 
the  residing  processes  into  each  node's  memory.  The  type  loading  part  is 
running  in  time  sharing  mode  for  simplicity  reason.  Because  of  the  heavy 
context  switches,  the  root  processor  takes  about  13,000  instruction  cycles 
to  load  the  node  type  string  of  a  five  level  tree  into  a  four  level  tree 
machine.  Meanwhile,  it  takes  about  10,000  Instruction  cycles  to  load  all 
the  program  code  down  into  the  tree,  provided  that  the  root  node  contains 
two  different  node  types  and  there  are  totally  five  different  node  types, 
one  for  each  level,  and  each  program  is  200  words  long. 

For  an  L  level  binary  tree  which  is  mapped  onto  a  M  level  tree  machine,  L 
>  M,  the  time  to  load  the  node  type  string  at  the  root  processor  can  be 
formulated  as  follows: 

n+1 

T«7000+{2*  Sum  [2**(i-l )-2**(i-3)-l ] *1000>+N*1000*(L-n-l ) 
i-4 


where  a  -  M-L  and  N  -  2**n-l,  eg,  the  number  of  nodes  shared  in  the  root 
processor* 

The  documentation  for  the  scheduler  and  the  multi-node  downloader  is 
in  preparation. 

3.3.3  A  Modic  Compiler  for  Mosaic 
(Alain  Martin) 

We  are  building  a  compiler  for  translating  a  high-level  language  for 
distributed  computations  into  Mosaic  code. 

According  to  the  principle  that  "a  complex  system  that  works  is  invariably 
found  to  have  evolved  from  a  simple  system  that  worked"  (John  Gall),  we 
have  decided  to  start  with  a  simple  language  called  Modic.  The  sequential 
part  of  the  language  is  based  on  Dljkstra's  guarded  commands. 

Communication  and  synchronization  are  provided  by  input  and  output 
commands  (similar  to  CSP's)  on  channels.  A  channel  is  a  programming 
concept  that  makes  it  possible  to  match  an  input  command  in  one  process  to 
an  output  command  in  another  process.  A  Boolean  operation  on  a  channel, 
called  the  'Vrobe",  allows  one  to  test  whether  an  input  or  output  command 
is  pending  on  the  channel. 

Later,  the  language  will  be  extended  with  procedures,  dynamic  creation  of 
processes,  multiple  channels,  (i.e. ,  channels  shared  by  more  than  two 
processes),  select,  and  broadcast  operations. 


4.  VLSI  DESIGN 


4.1  Switch  Simulation  Tools 

4.1.1  FMOSSIM 

(Mike  Schuster,  Randy  Bryant) 

FMOSSIM,  a  fault  simulator  for  MOS  digital  systems  first  became 
operational  in  April,  1983.  This  program  utilizes  the  same  switch-level 
representation  of  MOS  circuits  as  the  logic  simulator  MOS SIM  II.  As  a 
consequence,  it  can  model  such  MOS  circuit  structures  as  (bidirectional) 
pass  transistors,  static  and  precharged  logic,  busses,  and  both  static  and 
dynamic  memory.  Faults  are  represented  as  alterations  of  the  switch-level 
description  causing  selected  nodes  to  be  stuck-at  0  or  1,  or  selected 
transistors  to  be  stuck  open  or  closed.  Faults  such  aj  breaks  in  wires  or 
short  circuits  between  wires  can  also  be  modeled  by  adding  extra  "fault" 
transistors  to  the  network  description. 

This  combination  of  circuit  and  fault  modeling  capabilities  is  far  more 
general  than  has  been  achieved  previously.  FMOSSIM  utilizes  concurrent 
simulation  techniques  to  simultaneously  model  the  good  circuit  and  a  large 
number  of  faulty  circuits,  and  consequently  requires  much  less  CPU  time 
than  simple  serial  fault  simulation.  Both  the  utility  and  the  performance 
of  this  program  seem  quite  promising. 

A  paper  on  FMOSSIM,  to  be  published  in  the  Proceedings  of  the  MIT 
Conference  on  Advanced  Research  in  VLSI,  January  1984,  is  included  as 
Appendix  B  of  this  report,  and  is  also  available  as  a  Caltech  technical 
report  [5101:TR:831 . 

4.1.2  The  MOSSIM  Simulation  Engine 
(Bill  Dally,  Randy  Bryant) 

We  have  begun  the  design  of  the  Mosslm  Simulation  Engine  (MSE),  a  special 
purpose  processor  for  performing  switch  level  simulation  of  MOS  VLSI 
circuits  [5100:TR:83] .  A  single  MSE  processor  will  be  constructed  from 
'400  TTL  MSI  and  MOS  memory  devices  packaged  on  a  single  15"  x  15" 
wire-wrap  board,  and  will  perform  switch  level  simulation  at  a  rate  of 
~5*10~5  logic  events  per  second,  500  times  faster  than  MOSSIM  II  running 
on  a  VAX-1 1/780.  Several  MSE  processors  may  be  connected  in  parallel  to 
achieve  additional  speedup. 

4.2  Circuit  Simulation  on  the  Cosmic  Cube 
(Sven  Mattlson,  Lennart  Johnsson,  Chuck  Seitz) 

We  have  completed  a  study  of  M0S-VLSI  circuit  simulation  formulations  for 
concurrent  execution  [5096:DF:83] . 


This  effort  Is  motivated  by  two  threads  of  our  research.  First,  SPICE 
uses  lots  of  time  on  our  computers  as  well  as  oa  many  people's 
supercomputers,  and  such  a  program  would  provide  a  more  economical  way  to 
doing  these  simulations.  The  second  reason  is  that  circuit  simulation  is 
an  excellent  paradigm  of  a  computation  with  concurrency  opportunities  in 
which  the  communication  graph,  while  fixed,  is  not  so  regular  as  in  the 
computations  being  done  by  our  collaborators  in  the  sciences  at  Caltech. 

The  formulation  we  have  chosen,  a  modified  nodal  admittance  matrix,  is  not 
as  general  as  the  usual  circuit  simulator,  and  will  not  treat  elements 
such  as  ideal  opamps,  current  controlled  current  sources,  ideal 
transformers,  or  nonlinear  elements  that  lack  an  unambiguous  admittance 
description.  This  formulation  is,  however,  perfectly  adequate  for 
MOS-VLSI,  and  offers  some  storage  and  performance  advantages  over  more 
general  formulations. 

The  most  critical  aspect  of  the  approach  to  concurrent  execution  is  the 
sparse  matrix  equation  solution  method.  Direct  solution  methods  are  hard 
to  implement  on  a  machine  such  as  the  cosmic  cube,  and  iterative  methods 
very  natural.  The  most  common  iterative  methods  are  the  Jacobi  (J), 
Gauss-Seldel  (GS),  successive  overrelaxation  (SOR),  and  the  conjugate 
gradient  (CG)  method. 

Among  these  basic  iterative  methods,  J,  GS,  and  SOR  use  only  one  row  at  a 
time  in  the  matrix  to  calculate  a  new  estimate  for  each  component  in  the 
unknown  vector.  Thus  there  is  both  locality  and  concurrency  achieved  by 
row  partitioning.  SOR  requires  an  accurate  estimate  of  the  relaxation 
parameter  to  be  efficient,  and  its  local  calculation  is  sufficiently 
difficult  to  eliminate  SOR  from  consideration.  CG  is  not  strictly 
iterative,  and  because  it  uses  matrix  inner  products  during  each  Iteration 
is  less  local  than  J  or  GS. 

When  choosing  a  method  for  solving  the  matrix  equation,  one  must  weigh 
also  that  the  matrix  solving  on  sequential  machines  represents  only  10:20Z 
of  the  cycles,  while  the  more  freely  concurrent  model  equation  evaluation 
represents  80:90Z  of  the  cycles.  One  prefers  that  the  partitioned  matrix 
equation  solving  algorithm  have  an  Interface  to  the  model  evaluating 
routines  that  causes  as  little  communication  and  redundant  model 
evaluation  as  possible.  There  is  a  good  correspondence  between  the 
elements  and  the  row  partitioning.  Each  row  in  the  modified  nodal 
admittance  representation  contains  only  entries  from  devices  connected  to 
the  node  represented  by  the  row. 

In  order  to  see  if  iterative  methods  are  considerably  slower  than  direct 
ones,  SPICE  2  version  G.5  was  modified  to  use  GS  iterations  in  the 
transient  analysis.  GS  was  used  Instead  of  Jacobi,  since  it  does  not  need 
extra  storage  for  the  new  estimates  of  the  unknowns,  and  is  easier  to 
code,  but  since  GS  and  J  have  similar  convergence  properties,  the  results 
should  apply  to  both  methods.  Three  different  test  circuits  were  used, 
with  the  consistent  result  that  the  performance  was  nearly  identical. 

Thus,  based  on  these  studies,  the  circuit  simulator  we  are  writing  for  the 


cosmic  cube  will  use  a  row  partitioned  matrix  equation  solving  routine. 
Among  the  Iterative  solution  methods,  the  Jacobi  Iteration  assures 
convergence  Independent  of  the  order  In  which  the  different  processes 
complete  their  calculations  )for  matrices  conditioned  In  a  way  that  is 
easily  assured  for  MOS  circuits).  Within  each  process  the  modified 
Newtoo-Raphson  method  is  used  as  the  inner  Iteration  of  the  global  Jocobl 
Iteration.  A  first  or  second  order  predictor-corrector  backward 
differentiation  formula  is  assumed  as  the  integration  algorithm. 


4.3  From  Circuit  to  Layout  -  Another  Approach 
(Tak-Kwong  Ng,  Lennart  Johnsson) 

The  circuit  embedding  problem  can  be  transformed  into  the  problem  of  graph 
embedding.  A  proper  graph  model  for  studying  MOS  circuit  layout  topology 
has  been  proposed,  and  an  alogorlthm  for  mapping  a  circuit  into  its  graph 
model  has  been  implemented. 

Transistors  whose  gates  belong  to  the  same  net  are  lumped  together  to  form 
one  component.  This  component  is  represented  by  a  vertex.  Each 
individual  source  or  drain  is  mapped  into  an  edge  incident  on  this  vertex. 
All  other  connections  to  the  gate  net  are  represented  by  edges  Incident  on 
this  vertex. 

Serial  transistor  configurations  can  be  considered  as  a  transistor  with 
multiple  gates.  Such  component  is  represented  also  by  a  vertex.  The 
source  and  drain  are  mapped  into  edges  incident  on  this  vertex.  All  gate 
connections  are  represented  by  edges  incident  on  this  vertex. 

The  exterior  of  a  layout  is  also  mapped  into  a  vertex.  For  each  external 
net,  there  is  an  edge  connecting  the  exterior  vertex  and  the  external  net 
vertex. 

Hopcroft  and  Tarjan's  graph  planarity  testing  algorithm  is  modified  to 
Include  the  procedure  for  saving  the  appropriate  information  which  is 
necessary  for  deriving  the  proper  embedding.  The  information  is  saved  as 
constraint  graphs.  Thus,  the  graph  embedding  problem  becomes  the  problem 
of  assigning  edges  to  some  plane  such  that  all  constraints  are  met,  and 
a  predefined  function  is  optimized.  Optimal  solutions  can  be  found  by 
enumeration.  Several  heuristic  approaches  are  being  studied. 

4.4  Mosaic  RAM  element  design 
(Steve  Rabin,  Chris  Lutz,  Chuck  Seitz) 

A  RAM  section  for  the  Mosaic  element  has  been  designed.  Each  RAM  section 
stores  4K  bits,  and  8  copies  of  this  macrocell  will  be  used  in  the  16  MSL 
version  of  the  Mosaic  element. 

Although  specialized  processes  provide  higher  storage  density  than 


processes  that  are  suitable  for  the  Mosaic  processor,  a  processor 
with  on-chip  memory  on  has  many  advantages  over  processor  and  memory  In 
separate  packages.  These  advantages  Include  many  times  lower  volume,  pin 
count,  signal  energy,  and  driver  delay,  resulting  primarily  from  the 
integration  of  the  memory  bus  Into  a  single  package. 

The  Mosaic  memory  was  designed  to  be  quite  process  technology  independent, 
and  able  to  benefit  from  nMOS  technology  to  under  2um  feature  size.  It 
was  also  designed  to  take  full  advantage  of  and  extend  the  circuit  style 
of  "hot  clock"  bootstrap  drivers  used  in  the  Mosaic  processor.  The  final 
design  uses  exactly  no  depletion  transistors. 

The  memory  must  be  very  fast,  fast  enough  to  perform  an  access  at  least 
every  300  tau,  and  have  a  16-bit  word  interface.  The  memory  must  have  a 
negligible  soft  error  rate,  and  would  ideally  dissipate  negligible  DC 
power.  The  memory  must  be  flexible  enough  to  be  configured  in  various 
sizes  of  up  to  the  4K  words  (64K  bits)  of  processor  address  space. 

Process  independence  and  organizational  simplicity  lead  us  to  select  a 
two-bus  three-transistor  dynamic  RAM  memory  cell  design. 

Commercial  single  transistor  dynamic  memories  require  dynamic  node  refresh 
every  2ms.  Systems  using  such  devices  typically  use  error 
detecting/correcting  codes  to  bring  soft  errors  to  acceptable  levels.  It 
is  expected  that  the  use  of  many  banks  of  concurrently  refreshing  three 
transistor  cells  with  fairly  large  storage  nodes  combined  with  the  SO  usee 
refresh  period  provided  by  Mosaic  processor  will  provide  very  good 
immunity  to  soft  errors. 

The  memory  bus  is  pipelined  to  take  advantage  of  the  rather  generous 
memory  latency  permitted  by  the  processor.  Allowable  instruction  latency 
is  one  cycle  plus  one  phase.  Allowable  data  or  write  latency  is  two 
complete  cycles.  Pipelining  in  this  fashion  allows  us  to  reduce  the 
bandwidth  to/from  the  processor  to  29  bits/cycle  (1  write,  12  address,  16 
data). 

Each  memory  access  starts  with  a  column  read  followed  almost  always  by  a 
write  to  the  same  column  on  the  next  cycle.  Because  this  write  occurs  in 
the  same  cycle  as  the  next  read,  the  storage  control  is  somewhat  tricky. 
Mandating  a  write  causes  one  of  several  words  from  the  memory  to  be 
replaced  with  write  data  from  the  processor.  If  another  word  on  the  same 
column  is  accessed  on  that  cycle,  the  subsequent  write  back  to  the  column 
would  permanently  store  incorrect  (stale)  data.  For  this  reason  a  write 
back  override  is  enabled  on  the  second  cycle  after  a  write  cycle. 

Two  subsidiary  clock  phases,  phllL  and  phl2L  have  been  introduced,  and 
divide  the  processor  cycle  into  6  epochs.  A  conservative  clocking 
discipline  is  used  to  permit  switch  level  simulation  using  a  unit  delay 
timing  model. 

Circuit  design  for  the  Mosaic  storage  element  avoids  depletion  loads  due 
to  associated  scaling  and  process  problems  incompatible  with  the  goals  of 


aggressive  technology  Independence  and  low  DC  power  consumption*  Dynamic 
logic  la  used  extensively.  Race  conditions  are  avoided  entirely  and 
charge  sharing  Is  kept  to  a  minimum.  Large  capacitive  loads  are  admirably 
driven  by  rising  edge  logic  using  bootstrap  clock  drivers  combined  with 
precharged  logic  devices  typically  underneath  the  memory  bus  Itself. 
Circuit  simulations  predict  operation  at  15  MHz  to  match  the  predicted 
performance  of  3  micron  Mosaic  processors. 

So,  our  objectives  have  been  met  by  a  conceptually  simple,  scalable 
memory,  optimized  for  advancing  MOS  process  technology.  This  memory 
design  Imposes  the  following  domain  restrictions: 

Consecutive  write  operations  are  not  supported. 

Read  operations  Immediately  following  a  write  operation  will  not 
refresh  the  dynamic  storage  nodes  of  the  column  so  read. 

Reading  a  word  that  was  written  the  previous  cycle  Is  not  supported. 

The  first  two  conditions  do  not  occur  in  the  microcode,  and  the  third 
condition  occurs  only  by  writing  Into  the  instruction  stream. 

4.5  SOS  technology,  PCB  technology 

(Chuck  Seitz) 

A  fairly  extensive  cell  library  for  CMOS/SOS,  all  in  Manhattan  geometry 
in  order  to  be  compatible  with  design  tools  that  do  not  handle  real 
geometry,  has  been  provided  to  MOSIS  for  distribution.  Circuits  received 
from  the  first  MOSIS  SOS  run  have  been  tested,  and  the  process  behaves 
exactly  aa  expected.  A  new  SOS  technology  writeup  is  in  preparation  for 
MOSIS  distribution. 

We  were  happy  to  have  assisted  Ms  Mosls  in  the  development  of  PCB 
services.  Earl  and  our  various  plotting  programs  have  been  modified  to 
accommodate  PCB  technology,  and  we  have  made  use  of  this  service  as 
Indicated  In  section  2.2  to  package  a  prototype  Mosaic  machine.  The  PCBs 
received  very  quickly  from  MOSIS  were  unremarkable  except  for  a  MOSIS  logo 
of  excessive  size. 
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ABSTRACT 

The  Mosaic  element  is  a  fast  single  ehip  com¬ 
puter  designed  to  be  used  in  groups  for  concurrent 
computation  experiments.  Each  element  contains  a 
15-bit  processor,  read-write  storage,  read-only  store 
for  a  small  initialisation  and  bootstrap  loading  pro¬ 
gram.  four  input  ports,  and  four  output  ports.  The 
Mosaic  processor,  a  highly  structured  design  that 
achieves  very  good  performance  and  density  through 
innovations  in  its  microcode,  circuit  techniques,  and 
layout,  is  described  in  detail. 


INTRODUCTION 

Myriads  of  Mosaic  elements  can  be  connected 
together  by  their  ports  in  a  variety  communication 
plans  to  form  a  family  of  specialised,  high  perfor¬ 
mance,  concurrent,  and  programmable  computing 
engines.  In  addition  to  its  end  use  as  acomponent  for 
experiments  with  concurrent  computing  engines,  the 
Mosaic  element  has  been  an  interesting  vehicle  for 
numerous  adventures  in  VLSI  design,  design  tools, 
and  testing.  It  includes  experiments  and  innovations 
in  its  microcode,  circuit  techniques,  and  layout,  with 
performance  being  a  central  objective  throughout. 

A  Mosaic  element  with  4K  bytes  of  read-write 
storage,  approximately  140K  transistors  on  a  chip 
4000  lambda  square  (6  mm  square  at  3  micron  fea¬ 
ture  size),  is  sufficiently  complex  to  have  given  our 
design  tools  a  thorough  workout,  and  have  stretched 
our  capabilities  for  laying  out,  verifying,  and  testing 
large  structured  designs. 

The  original  models  for  this  project  were  (1) 
Sally  Browning’s  research  on  algorithms  for  a  pro- 

The  research  described  is  this  paper  waa  eponsored  by  the 
O (Tease  Advanced  Research  Projects  Agency,  ARPA  Order 
aember  3771,  sad  mo  alto  red  by  the  Office  of  Naval  Research 
aader  contract  somber  N00014-7P-C-0607. 


grammable  tree  machine13*  and  (2)  the  'OM*  de¬ 
scribed  in  Mead  &  Conway4.  Mosaic  started  out  as  a 
tree  machine  element,  but  we  have  since  come  to  see 
it  as  a  building  block  for  a  variety  of  fine  grain  en¬ 
semble  machines*  with  connection  plans  up  to  degree 
four,  such  as  a  tree,  mesh,  shuffle,  chordal  ring,  or 
cube  connected  cycle.  The  influence  of  the  OM2  on 
the  processor  datapath  layout  is  apparent. 

Several  early  attempts  to  lay  out  a  much  less 
ambitious  processor  with  a.4-bit  path  to  off-chip  stor¬ 
age  managed  to  break  our  design  tools,  and  were 
thus  indirectly  the  origin  of  the  constraint  solving 
composition  and  geometry  tool  Earl*  used  for  the 
present  design.  A  new  processor  with  a  15-bit  path  to 
storage  that  could  be  placed  on-chip  was  designed  in 
1982,  sent  to  MOS1S  in  January  1983,  and  functioned 
essentially  correctly,  and  at  7  MHs  (4  micron  feature 
sise),  on  first  silicon  in  February  1983.  The  processor 
design  was  subsequently  augmented  to  include  addi¬ 
tional  functions,  to  speed  up  the  control  PLA 
to  incorporate  the  planned  on-chip  storage.  It  is  this 
latest  design  that  is  described  here. 

TOP  LF.VEl  VIEW 

It  appears  that  most  of  the  silicon  area  in  mul¬ 
tiple  instruction  multiple  data  (M1MO)  ensemble  ma¬ 
chines  will  be  devoted  to  storage.  In  Cosmic  Cube1, 
a  larger  grain  sise  machine  at  Caltech  whose  or¬ 
ganisation  is  otherwise  similar  to  Mosaic,  the  frac¬ 
tion  of  the  element  complexity  devoted  to  storage  is 
about  75%.  With  the  precondition  that  a  complete 
Mosaic  element  fit  on  a  single  chip,  and  using  today’s 
MOSIS  nMOS  fabrication  with  1.5  micron  lambda  (3 
micron  feature  sise)  on  chips  6  mm  square,  the  com¬ 
plexity  of  today’s  Mosaic  element  is  limited  to  4000 
by  4000  lambda,  or  10  million  square  lambda  (MSL). 
This  area  is  apportioned  with  about  2.5  MSL  for  the 
processor  and  ports,  1  MSL  for  the  pad  frame,  and 
12  MSL  (75%)  for  storage  and  its  interconnect.  The 
area  allowed  for  the  processor  is  quite  small,  less  than 
0  sq  mm,  or  9,000  sq  mils. 


The  storage  is  partitioned  into  several  smaller 
arrays,  as  suggested  bj  the  analysis  presented  in  sec¬ 
tion  8.S  of  Mead  &  Conway4.  Each  array  is  409ft  bits, 
84  by  84,  organised  to  interface  with  the  processor  as 
258  18-bit  words.  The  densest  read-write  storage  we 
understand  how  to  make  with  MOSIS  nMOS  technol¬ 
ogy  is  based  on  a  3-transistor  dynamic  storage  cell, 
which  requires  that  this  storage  be  refreshed  peri¬ 
odically.  This  refresh  function  is  accomplished  in  the 
microcode  of  the  processor.  The  very  small  amount 
of  read-only  storage  required  for  the  initialisation 
and  bootstrap  loader  is  implemented  with  a  set  of 
"maimed’’  RAM  cells. 

Thus  the  18  MSL  Mosaic  has  the  fioorplan  shown 
in  figure  1,  but  if  more  MSL  were  made  available  by  a 
reduction  in  lambda,  one  could  use  this  area  to  pack 
in  more  storage.  While  the  processor  is  only  16% 
of  the  area  of  the  chip,  it  represents  about  90%  of 
the  design  effort,  so  most  of  the  following  description 
concentrates  on  the  processor. 


Figure  1:  Mosaic  Element  Fioorplan 

The  Mosaic  element  is  synchronous,  with  2-phase 
non-overlapping  clocks  supplied  externally.  The  stor¬ 
age  cycle,  processor  microcycle,  and  datapath  oper¬ 
ations  occur  in  parallel  in  one  clock  cycle. 

PROCESSOR  ORGANIZATION 

Figure  2  is  a  detailed  block  diagram  of  the  pro¬ 
cessor,  while  figure  3  is  the  fioorplan  of  the  core  of 
the  processor,  without  the  surrounding  storage  and 
pads.  The  processor  has  two  principle  components:  a 


datapath/port  block,  and  a  controller.  Each  is  a  very 
dense,  mostly  metal-limited  block  of  layout.  The 
datapath/port  block  is  functionally  centered  around 
the  processor’s  single  18-bit  internal  data  bus;  it  is 
controlled  by  signals  issued  by  the  PLA-based  con¬ 
troller. 

DATAPATH 

The  Mosaic  datapath  contains  those  parts  of 
the  processor  that  communicate  over  the  internal 
data  bus.  These  parts  include  sixteen  general  pur¬ 
pose  registers,  an  ALU-shifter  with  associated  flags, 
a  memory  address  section,  an  interrupt  counter,  four 
input  ports,  four  output  ports,  and  an  interface  to 
the  memory  data  bus  and  memory  address  bus.  The 
ports  are  discussed  in  the  following  section.  The 
functional  blocks  in  the  datapath  are  organised  in  a 
bit  slice  pattern,  one  bit  of  the  bus  running  through 
each  bit  slice,  with  a  bit  slice  pitch  of  34  lambda. 
In  the  first  clock  phase  of  each  cycle  the  bus  is  pre¬ 
charged  and  the  ALU-shifter  computes  a  new  result. 
The  second  (last)  clock  phase  is  used  for  the  bus 
transfer  and  the  ALU  carry  chain  precharge. 

The  ALU  obtains  operands  from  a  pair  of  latches, 
called  X  and  Y,  that  are  loaded  from  the  bus.  The 
ALU  result  serves  as  input  to  the  shifter,  which  uses 
pass  gates  to  route  correctly  shifted  data  to  the  ALU- 
shifter  output.  The  ALU  is  logically  very  similar  to 
the  ALU  in  the  OM  design4,  with  a  pair  of  function 
blocks  and  a  precharged  pass  transistor  carry  chain. 
Although  the  ALU  does  not  use  carry  lookahead,  it 
is  optimised  to  the  extent  that  it  is  not  in  the  critical 
timing  path.  An  associated  special  purpose  register, 
the  Multiplier/Product,  allows  the  processor  to  per¬ 
form  a  multiply  step  in  one  microcycle.  The  multiply 
macroinstruction  produces  a  32-bit  unsigned  product 
in  20  microcycles. 

The  processor  maintains  four  flags  associated 
with  the  ALU/shifter.  These  are  the  familiar  Z  (zero 
result),  N  (negative  result),  V  (two’s  complement  over¬ 
flow),  and  C  (carry/not  borrow).  The  controller  can¬ 
not  sense  the  values  of  the  flags  directly.  Instead,  a 
fixed  3-bit  field  in  conditional  branch  macroinstruc¬ 
tions  specifies  one  of  eight  branch  conditions.  These 
three  bits,  as  well  as  the  values  of  the  four  flags,  are 
inputs  to  a  small  PLA  that  produces  one  bit  of  out¬ 
put,  the  "flag  condition”.  This  bit  is  an  input  to  the 
controller,  which  tests  it  when  performing  the  con¬ 
ditional  branch  instructions.  The  branch  condition 
codes  were  assigned  careful'y  so  the  flag  condition 


PLA  requires  onlj  seven  implieants.  Since  the  PLA 
is  so  small,  it  fits  neatly  next  to  the  flags  in  the  up¬ 
per  right  corner  of  the  processor,  in  a  region  formed 
by  removing  the  top  four  bit  slices  of  the  address 
section. 

On  every  microcycle  the  address  section  emits 
a  new  memory  address  onto  the  12  memory  address 
wires  that  come  out  of  the  right  edge  of  the  datapath. 
A  12-bit  address  is  currently  sufficient  for  the  number 
of  words  of  memory  we  can  place  on  chip.  The  ad¬ 
dress  generation  section  houses  the  program  counter 
(PC)  register,  the  current  refresh  address  (RA)  reg¬ 
ister,  and  an  incrementer  used  with  both  registers. 
The  controller  guarantees  that  the  RA  is  incremented 
and  issued  to  the  memory  frequently  enough,  at  least 
once  every  8  microcycles,  to  keep  the  dynamic  memo¬ 
ry  refreshed.  Only  memory  cycles  which  would  other¬ 
wise  go  to  waste  are  used  for  refresh  cycles. 

The  processor  can  generate  timed  interrupts  us¬ 
ing  its  interrupt  counter  (1C).  The  IC  is  a  lft- bit 
register  which  counts  down  once  per  mieroeycle  and 
causes  an  interrupt  when  it  reaches  zero.  In  order 
to  guarantee  interrupt  service  in  bounded  time,  the 
port-wait  states  must  be  interruptible.  Thus  the  ride 
effects  of  any  input  or  output  instruction  that  can¬ 
not  be  completed  immediately  are  reversed,  and  the 
instruction  is  refetched  and  restarted.  Timed  in¬ 
terrupts  are  useful  for  decoupling  communications 
from  processing,  [eg,  to  implement  automatic  mes¬ 
sage  routing,  and  to  buffer  large  blocks  of  data)  and 
to  give  the  processor  a  sense  of  time  [eg,  for  heuristic 
searches). 

PORTS 

Mosaic  processors  communicate  with  each  other 
through  their  ports.  Each  processor  has  4  input  ports 
and  4  output  ports.  Connecting  an  output  port  of 
one  processor  to  an  input  port  of  another  processor 
forms  a  two-word  fifo.  Each  output  port  is  based 
on  a  parallel-in,  serial-out  shift  register;  each  input 
port  is  based  on  a  serial-in  parallel-out  shift  register. 
The  communication  between  input  and  output  ports 
is  bit  serial. 

Mosaic’s  implementation  of  the  ports  requires 
only  a  single  wire,  called  the  port  link,  to  connect  an 
input  to  an  output  port.  When  a  port  is  not  ready  to 
perform  a  serial  transfer,  because  it  is  an  output  port 
with  no  data  or  an  input  port  with  unremoved  data, 
it  clamps  the  port  link  to  ground.  On  the  microcycle 
when  both  ports  are  ready  to  perform  a  transfer, 


neither  port  grounds  the  port  link  and  it  is  pulled  up 
to  VDD  by  an  external  puliup  resistor.  Both  ports 
recognise  this  signal  as  the  "start  bit”  of  a  transfer, 
much  as  in  RS-232  data  communications.  The  next 
10  microcycles  pass  the  data  serially  on  the  port  link, 
and  then  the  ports  revert  to  the  clamp-if-not-ready 
state. 

This  protocol  allows  multiple  input  porta  to  be 
connected  to  the  same  link:  all  input  ports  receive 
data  from  the  output  port  beginning  on  the  cycle 
when  all  the  ports  are  ready.  We  didn’t  notice  this 
feature  until  after  we  had  running  chips. 

CONTROLLER 

On  every  microcycle  the  Mosaic  controller  issues 
a  new  set  of  signals  to  control  the  datapath.  The 
original  plans  for  the  controller  assumed  a  rather 
conventional  organization  in  which  microcode  words 
were  fetched  from  a  ROM,  and  a  new  ROM  address 
was  computed  every  microcycle  by  a  conglomeration 
containing  an  incrementer,  multiplexers,  and  other 
miscellaneous  logic.  This  design  was  simplified  when 
we  realised  that  we  could  efficiently  program  a  PLA 
to  perform  most  of  the  original  controller’s  function. 
An  auxiliary  PLA  which  controlled  the  ALU/shifter 
proved  to  be  very  troublesome  because  we  could  not 
find  a  placement  for  it  that  did  not  result  in  large  wir¬ 
ing  channels  and  expanses  of  white  space.  We  finally 
eliminated  the  auxiliary  PLA  by  learning  to  program 
the  main  controller  PLA  to  perform  its  function.  The 
controller  became  merely  a  PLA  with  latches. 

In  most  microcircuit  instruction  processors  the 
datapath  is  the  more  regular  part,  and  the  control 
the  less  regular  part.  In  the  Mosaic  processor,  the 
controller  is  even  more  regular  than  the  datapath. 

The  controller  has  20  inputs:  10  bits  from  the 
instruction  register,  the  flag  condition,  the  port  con¬ 
dition,  the  multiplier  bit  from  the  multiplier/product 
register,  the  interrupt  request  from  the  interrupt  count¬ 
er,  the  processor  reset,  and  5  feedback  bits  (outputs 
from  the  controller  clocked  directly  back  to  the  con¬ 
troller  input).  The  instruction  register  (I)  holds  the 
current  maeroinstruction  and  can  be  latched  from 
the  memory  data  bus  on  command  from  the  control¬ 
ler.  So  little  feedback  state  is  needed  because  much 
of  the  state  is  held  in  the  instruction  register,  and 
the  sequences  to  implement  macro  instructions  are 
short,  typically  S  microcycles.  (The  shorter  instruc¬ 
tions  are  in  practice  executed  more  frequently,  so 


the  average  execution  time  ia  4  microcycles.)  Moat 
of  the  40  output*  from  the  controller  go  to  dock- 
AND  bootstrap  drivers  that  drive  control  lines  into 
the  datapath.  These  outputs  are  effective  during  the 
mierocjcle  following  the  microinstruction  fetch. 

INSTRUCTION  SET 

The  tables  in  figure  4  summarise  the  macro¬ 
instruction  set.  All  instructions  are  one  word  fol¬ 
lowed  by  sero,  one,  or  two  words  of  immediate  data. 
In  the  first  instruction  word,  the  two  4-bit  fields  J  and 
K  can  be  used  to  specify  one  of  the  general  registers. 
In  some  instructions,  the  K  field  may  specify  one  of 
the  ports  or  a  branch  condition  instead. 

There  are  two  types  of  instructions:  MOVE*  and 
Arithmetics.  MOVE  instructions  fetch  an  operand 
specified  by  a  3-bit  MSOURCE  field,  and  assign  it  as 
specified  in  the  4-bit  MDE5T  field.  The  MOVES  in¬ 
corporate  subroutine  linkage  and  branches:  specify¬ 
ing  an  MDEST  of  PC  performs  a  jump;  an  MDE8T  of 
PCF  — *0 — R;  X— »PC  performs  a  subroutine  call  by 
pushing  the  current  PC  on  a  stack,  and  then  assign¬ 
ing  it. 

Arithmetic  instructions  fetch  two  operands,  X 
and  Y,  as  specified  by  the  3-bit  MODE  field.  (The  X 
and  Y  operands  in  fact  correspond  to  the  hardware 
registers  X  and  Y  at  the  input  to  the  ALU.)  Then 
they  perform  the  operation  specified  by  the  4-bit  OP 
field,  which  requires  computing  some  function  of  X 
and  Y,  and  usually  assigning  a  result  as  specified  by 
the  MODE. 

Instructions  that  write  to  an  output  port  wait 
until  there  is  room  in  the  fifo.  Instructions  that  read 
from  an  input  port  wait  until  there  is  a  word  to  read, 
and  can  optionally  'advance*  the  port  (remove  the 
word  from  the  fifo). 

The  richness  of  this  instruction  set  is  justified 
by  the  code  compactness  it  offers  in  its  environment 
of  scarce  on-chip  memory. 

MICROCODE 

The  speed,  simplicity,  and  compactness  of  this 
design  owe  much  to  the  realisation  that  the  controller 
need  be  nothing  more  than  a  PLA  with  latches.  But  a 
PLA  is  not  merely  sufficient;  it  is  convenient  and  easy 
to  program  for  an  instruction  set  such  as  Mosaic’s 
in  which  microinstruction  sequences  are  short  but 
heavily  branched. 


We  chose  to  view  each  of  the  120  implieants  in 
the  PLA  as  a  word  of  microcode.  More  than  one 
word  of  microcode  can  be  active  (that  is,  more  than 
one  implicant  can  be  TRUE)  in  any  given  microcycle. 
Usually  only  one  word  is  active  at  a  time,  but  there 
are  important  exceptions.  In  these  cases,  the  out¬ 
puts  are  partitioned  into  disjoint  sets,  such  that  each 
active  word  has  no  TRUE  outputs  (transistors  in 
the  OR  plane)  outside  its  set.  The  effect  of  mul¬ 
tiple  active  words  used  in  this  restricted  manner  is 
like  that  of  multiple  disjoint  PLAs,  but  the  physi¬ 
cal  layout  retains  the  regularity  of  one  PLA.  In 
return  for  this  self-imposed  restriction,  the  absolute 
true/complemented  sense  of  the  individual  outputs 
is  irrelevant,  the  microcode  assembler  and  assembly 
language  is  simpler,  and  the  microcode  is  easier  to 
understand. 

A  simple  microcode  assembler,  written  in  SIMU¬ 
LA,  reads  the  source  microcode  and  assembles  it  into 
a  runtime  data  structure.  From  here  the  assembler 
can  output  the  code  in  any  of  several  formats,  includ¬ 
ing  Earl  source  code.  The  assembler  also  contains  an 
ad  koe  register-transfer  level  simulator  of  the  entire 
processor.  This  simulator  served  as  an  initial  debug¬ 
ger  for  the  processor  design,  and  is  still  the  initial 
testing  ground  for  modifications  in  the  processor  and 
its  microcode. 

To  illustrate  some  features  of  the  microcode  pro¬ 
gramming  style  and  processor  timing,  the  rest  of  this 
section  is  a  blow-by-blow  description  of  the  execution 
of  a  sample  macroinstruction.  Figure  5  shows  the 
assembly  of  the  instruction  ADD  #7,R1,R2,  the  4 
microcode  words  required  to  execute  it,  and  the  be¬ 
havior  of  various  parts  of  the  processor  in  the  vicinity 
of  these  3  cycles.  This  instruction  adds  immediate 
data  7  to  the  contents  of  register  1,  and  stores  the 
result  in  register  2.  The  instruction  executes  in  3 
microcycles,  corresponding  to  the  first,  second,  and 
last  two  microcode  words  (the  last  two  are  active 
simultaneously). 

The  tokens  '.decode*,  “.get*,  and  *.go”  are  mne¬ 
monics  for  feedback  states;  they  appear  both  in  the 
input  conditions  and  in  the  next  state  outputs.  The 
first  microcode  word,  'DECODE:*,  is  in  fact  the  first 
word  of  every  instruction.  It  becomes  active  any  time 
the  feedback  state  is  '.decode*,  and  no  interrupt  is 
pending  (*Interrupt=0").  Previous  microcode  has 
ensured  that  a  new  macroinstruction  was  fetched  on 
the  previous  cycle.  Thus  'DECODE:*  latches  it  into 


AU.  MSnWCTWWi 


FLAG  OCVCITIGNSi 


I 

L,l  I  1.1  1  1  I 

is  m  a  a  n  n  «  i 
follemO  by  0,  1,o r  2  eerO ■  of 


I  K 

J _ I _ L 

?  • 

hueudloto 


3  4 

mluo. 


3 


J 


2 


I 

!_) 


Run  X  oooclf  loo  a  port) 


K  flog  oonOltlon  |  x  flog  condition 

- 1 - 

0  V  [omrfloul  |  4  Z  Iwrol 

1  N  Inegotlml  j  3  Z  or  N  l<-  nrol 

2  1C  I  Curry  •  01  |  6  Z  or  1C  Itralgned  <-| 

3  N  nor  V  loIgnoO  <  I  |  7  Z  or  (N  nor  V)  lolgned  «| 


I  |Ade|Olr|  Pt  |  J  | 

L-l 1  I  -I  I  II  I  I  I  I  I  I  I  I  I 

19  14131211109370343210 

Ft  lo  mo  port  in atari  opoelfloo  ono  of  4  pert* 

Ofr-0  for  output  port*  Olrol  for  Input  pot 
Aden  to  edmnce  port  If  Input  pert  lo  rood 

KEY:  Ai  lo  roglotor  mOr  n. 

Amt  lo  roglotor  motor  n,  Inoronootod  oftor  reeding, 

— Ai  lo  roglotor  maber  n,  doer  oouitod  bofeno  roodlng. 
ml  lo  mo  Imadlato  mluo. 

•>  lo  mo  nonorp  oord  ehom  oOOrooo  lo  i. 

SulArt  lo  output  pert  nwAor  Pt. 
lifHj  t  lo  Input  part  manor  Ft. 

A  |  B  mono  oeneotonotloi  of  bit  fluid  A  cad  Wt  field  B. 
f<l>  mono  mo  l-m  bit  of  f.  ' 
f«l:J»  mono  l-m  to  J-m  Wto  of  f. 

FfC  lo  mo  f  log/TC  uordt  C  |  Y  |  N  |  Z  |  PC 


SPECIAL  CASES:  RESET:  AC  ->  t<-1)>  0  -»  PC 

IN1U4VT:  0  -»  ICj  f»C  -»  «-2>;  Bf-31  -»  PC 

move  uemcTKMSi 

|  0  I  MOlfCE  I  PCEST  j  X  j  j  | 

I— L-t  I  I  I  I  -1  -LI  1  I  I  I  I  I  I 

13  141312  111093763432  1  0 

n  mono  At  uhm  M90UCE  10  0,  I,  2,  or  3|  »  mono  AI  alfunrloo.  ’ 
*WI  X  PCEST  offoct 


0 

I 


AI  0 

4«J  I 

m#*  2 

•fRJtml)  3 

ml  4 

•ml  9 

imgrt  g 

1C  7 

C 


X  -»  P 
X->«R 
X  -»  4R*» 

X  ->  gfRfml ) 

X  ->  0 — A 
X  ->  pm  I 
X  -»  OulAort 
X  •>  1C 

1C  ->  Pj  *->  IC 
X  ->  PC 
X  -»  PPC 

X  -»  PC  If  flog  condition  X  trvo 
X  -»  PC  If  flog  oondltlai  X  foloo 
X  ->  PC  If  part  X  It  not  rmdy 
X  -»  PC  If  port  X  It  rood) 

PPC  ->  *— Pj  x  -»  PC 


fpimcric  ttcmcTtCMSi 


1 1 1 

1 _ L_ 

MCE 

J-  1 

T~ 

J _ 

<F 

1  t  1 

T  *  i  *  i 
i  l  i  i  i  i  i  •  i 

13  M 

13  12 

It 

10  9  • 

7  •  3  4  3  2  1  0 

m  Dtr  X 

Y 

Oart 

0 

AI 

t* 

m 

AGO  R1.R2 

1 

Wl 

AI 

m 

AGO  fvel,Rl,R2 

2 

mu 

m 

m 

ADD  RI.R2 

3 

twl 

AI 

m 

ACO  PAml,R1,R2 

4  0 

AI 

0 

OutPgrt  pop  Rl.Pl 

9  0 

Wl 

AI 

OutPert 

ACO  Pml,Rl,P1 

4  1 

llPort  AI 

RJ 

AGO  P1.RI 

IACD  Plt.Rl  to  odmncol 

3  1 

vml 

Iflftrt  RJ 

ACO  Bml,P1,Rt 

IACD  Pml,Pl4,RI  to  odmncol 

6 

eu 

m 

RJ 

ADD  M  (R1.R2 

7 

•ml 

RJ 

•ml 

ACO  M  •fml.P1 

All  Arltbaotlc  Inotructlom  notify  mo  Z,  N,  and  V  flap. 

Mir 

Carry  flag 

CP  Inotructlcn 

Noalc 

Effect  nodltladl 

0  ttCFMIt 

IKS  X  *  1 

->  Owt 

1  OBOrswwt 

CEC  X  -  1 

->  Out 

2  Arltfiaotlc  Shift  Plpt  ASR  X<I3>|X 

->  Dwt|C 

3  ROtoto  Right 

RDR  C|X 

->  DwtjC 

4  ROtoto  Loft 

POL  XtXtC 

->  Oast 

9  Logical  Shift  Right 

LSR  0|X 

->  Owt|C 

•  Rotate  Nibble  RlRt 

AM  X«3:0»  |  X<I3:4» 

->  Owt 

7  ACO 

AGO  X  ty 

->  Dart 

•  AGO  eim  Carry 

acoc  x  ♦  y  ♦  c 

->  Dart 

f  SUBtroct 

SIB  Y-  X 

->  0*t 

A  bltulae  comment 

CCN  IX 

->  Owt 

B  blteloe  eXclutlm  OR 

®R  X  oncluolm  or  y 

->  Oast 

C  blteloe  HC 

A»  Xtnd  Y 

->  Oast 

0  blteloe  CF 

OR  X  «r  Y 

->  Oast 

E  COAore 

OR  x  -  y 

P  AAtlply 

XL  hlfi  eord(X*y> 

->  fU 

leu  mrdixni 

->  m 

Modify  Z,N,V  boood  on  hlpp  uard 


Figure  4:  Diagram  of  the  Complete  Instruction  Set 
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A  microinstruction,  assembly  language: 


11:  A00  #7,R1,R2 


A  macroinstruction,  binary  code: 


10:  •  •  • 

11:  1001  1001  0010  0001 
12:  0000  0000  0000  0111 
13:  ... 

14:  ... 


[last  word  of  previous  Instruction] 

t  first  word  of  Instruction] 

Immediate  value  -  7] 

[first  word  of  next  instruction] 
[Immediate  value  for  next  Instruction,  or 
first  word  of  Instruction  after  next] 


Source  microcode  for  executing  the  microinstruction: 

The  syntax  for  a  source  microcode  word  1$: 

word  <mnemon1c>:  <1nput$>  ::  <outputs> 

word  DECODE:  .decode  Interrupts  ::  IN->I  RA++OA  RJ»>  X  Y  M  .get 

word  #,J,K:  .get  I-  1  0  0  1  ::  saveC  PC+*->A  IN»>  X  .go 

word  ADD:  .go  I«  1  ‘  ‘  1  0  0  0  ::  ALUONLY  GP-86  C1n«0  NOshlft  setZNV  setC 

word  ALU->K:  .go  !■  1  0  *  *  ::  NOALU  PC++->A  H->  RK  .decode 


Processor  timing  In  executing  the  microinstruction: 
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Figure  5:  Example  of  i  microinstruction  execution 
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the  instruction  register  (“IN— *1")  at  the  start  of  the 
cycle.  The  controller  has  not  had  time  to  branch 
based  on  the  new  instruction,  but  the  J  and  K  fields 
will  hare  arrived  at  the  register  decoder  in  time  to 
select  a  register  to  drive  the  bus  on  that  cycle;  thus 
this  microcode  word  fetches  one  of  the  registers  to  all 
of  the  destinations  where  it  might  be  needed  (“RJ=4 
X  Y  M*).  This  ‘register  prefetch"  saves  a  cycle  in 
most  instructions.  It  is  too  early  to  know  what  to  do 
with  the  next  memory  cycle,  so  the  microcode  uses  it 
as  a  refresh  cycle  (“RA++  -♦A",  a  macro  for  “HA— ► 
INC  Addl  INC-* A  A— *RA”). 

The  next  microcode  word  “#,J,K:”  is  condi¬ 
tional  on  the  first  bit  (  1  )  and  MODE  bits  (00 
1  )  of  the  instruction  register  ("1=  10  0  1")  and 
corresponds  to  an  arithmetic  instruction  with  an  im¬ 
mediate  value  and  a  register  as  operands.  In  the  com¬ 
plete  microcode  there  is  also  a  microcode  sequence 
conditional  upon  each  of  the  other  possible  MODE 
fields.  The  MODE  in  this  example  specifies  operand 
X  as  an  immediate  value,  which  is  obtained  from  the 
memory  data  bus  via  the  memory  data  input  buffer 
(“IN=»  X").  The  PC  is  then  incremented  past  the 
immediate  value  (“PC++  —►a”)  in  order  to  begin 
fetching  the  next  instruction.  The  next  state  ".go” 
indicates  that  all  operands  have  been  fetched  and  the 
code  for  the  operative  part  of  the  instruction  should 
take  over. 

The  last  two  microcode  words,  "ADD:"  and  "ALU 
-*  K:"  are  active  simultaneously  and  complete  the 
macroinstruction.  In  the  "ADD:”  word  the  token 
"ALUONLY”  indicates  that  this  word  specifies  only 
ALU  outputs  (i.e.  it  has  no  transistors  in  the  OR 
plane  for  other  outputs)  while  "NOALU*  in  the  "ALU—* 
K:”  indicates  that  this  word  controls  the  rest  of  the 
outputs.  The  "ADD:”  word  instructs  the  ALU  to  add 
its  inputs,  X  and  Y,  by  specifying  the  appropriate 
Generate  and  Propagate  codes  (*GP=  86”),  the  carry- 
in  (“Cin=  0”),  and  the  type  of  shift  ("NOshift”). 
The  complete  microcode  contains  similar  words  cor¬ 
responding  to  the  other  arithmetic  operations:  sub¬ 
tract,  increment,  etc.  They  are  independent  of  the 
MODE  field  of  the  instruction  but  dependent  on  the 
OP  field  (“1=  1***1000",  since  OP  code  for  ADD 
is  1000). 

The  "ALU-*K:”  word  deposits  the  ALU  out¬ 
put  in  register  K  ("W=»  RK”).  Other  words,  de¬ 
pendent  on  the  MODE  but  independent  of  the  OP 
code,  handle  the  other  possible  destinations.  Thus 


the  orthogonality  in  the  macroinstruction  set,  arith¬ 
metic  OPs  versus  MODEs,  is  represented  directly  in 
the  microcode.  Note  that  only  one  microcode  word, 
“ALU— *K:”,  is  needed  to  handle  four  MODE  cases, 
since  the  MODEs  have  been  carefully  encoded  so  that 
one  input  condition  ("I—  10**”),  decodes  all  four 
cases.  Careful  encoding  such  as  this  throughout  the 
instruction  set  has  led  to  more  compact  microcode. 
In  “ALU— *K:”  the  PC  is  incremented  and  used  as 
the  memory  address  (“PC++  -*A”),  as  it  is  in  the 
last  cycle  of  all  instructions.  This  begins  fetching  the 
word  after  the  next  instruction,  in  case  the  next  in¬ 
struction  takes  an  immediate  value  and  needs  to  use 
it  in  its  second  cycle. 

STORAGE 

Although  specialised  semiconductor  processes  pro¬ 
vide  higher  storage  density  than  those  suitable  for 
the  Mosaic  processor,  a  processor  with  on-chip  mem¬ 
ory  has  many  advantages  over  processor  and  memory 
in  separate  packages.  These  advantages  include  re¬ 
duced  volume,  pin  count,  signal  energy,  and  driver 
delay,  resulting  primarily  from  the  integration  of  the 
memory  bus  into  a  single  package. 

For  each  storage  bank,  a  two-bus  three-transistor 
dynamic  RAM  memory  Cvdl  is  organised  in  64x64  bits 
with  a  16-bit  word  interface.  All  the  banks  operate 
in  parallel  to  accomplish  parallel  refresh,  and  provide 
a  read  and  pipelined  write  operation  every  processor 
microcyele  (roughly  300  tau).  Each  memory  access 
starts  with  a  word-line  access  followed  almost  always 
by  a  refresh  write  to  the  same  word-line  on  the  next 
cycle.  Because  this  write  occurs  in  the  same  cycle  as 
the  next  read,  the  storage  control  is  somewhat  subtle. 

Mandating  a  write  causes  one  of  the  4  words 
read  from  the  selected  memory  bank  to  be  replaced 
with  write  data  from  the  processor.  This  write  data 
is  written  in  the  next  cycle,  in  parallel  with  the  next 
read.  However,  if  the  read  is  to  the  same  word  line 
as  the  pipelined  write,  stale  data  is  accessed,  and 
the  subsequent  write  back  to  the  word  line  would 
permanently  store  incorrect  data.  For  this  reason  a 
write  back  is  disabled  on  the  second  cycle  after  a 
write  cycle.  This  form  of  pipelining  imposes  domain 
restrictions  upon  the  microcode  in  that  consecutive 
writes,  refresh  following  write,  and  write  followed 
by  read  to  the  same  address  will  all  fail.  The  first 
two  conditions  do  not  occur  in  the  microcode,  and 
the  third  condition  occurs  only  by  writing  into  the 
instruction  stream. 


CIRCUIT  DESIGN 


Some  of  the  performance  and  layout  simplicity 
of  Mosaic  is  due  to  the  simple  clock-AND  bootstrap 
driver  shown  in  figure  0.  It  is  used  extensively  and  in 
several  variations  both  in  the  processor  and  storage 
sections.  In  the  processor,  this  clock-AND  is  used 
to  produce  control  signals  that  are  the  logical-and 
of  a  PLA  output  and  a  clock.  In  the  storage,  the 
clock-AND  is  used  so  extensively  in  driving  select  and 
data  lines  that  depletion  transistors  are  completely 
absent.  ~ 


Figure  6:  Clock-AND  Circuit 


Although  referred  to  as  a  “driver,''  this  clock- 
AND  does  not  provide  power  amplification  of  the 
clock,  but  rather  passes  a  replica  of  the  “hot  clock” 
input,  whatever  its  HIGH  voltage,  to  the  output  as 
gated  by  an  enable  signal  of  low  energy.  The  dock 
signal  typically  switches  between  ground  and  7  volts 
with  VDD  =3  5  volts,  but  the  chips  also  work  cor¬ 
rectly  at  reduced  speed  with  S  volt  clocks.  The  delay 
and  power  dissipation  of  these  clock- ANDs  is  almost 
negligible,  and  so  the  clock  driving  problem,  together 
with  the  power  dissipation  usually  required  in  control 
signal  drivers,  is  exported  to  outside  the  chip  where 
it  can  be  dealt  with  using  special  driver  circuits.  This 
hot  clock  technique  improves  performance  in  pass 
structures,  and  also  makes  the  performance  much 
less  sensitive  to  variations  in  the  depletion  threshold 
voltage  than  in  conventional  Mead-Conway  designs. 
Precharge  techniques  are  also  used  extensively  in  this 
chip,  both  to  save  power  and  for  speed. 

DESIGN  TOOLS 

The  layout  and  verification  was  accomplished 
on  a  VAX-1 1/780  running  Berkeley  Unix  with  design 
tools  written  in  MAINSAIL  and  C.  Circuit  design  and 
optimization  relied  primarily  on  tau  model  calcula¬ 
tions.  SPICE  was  used  to  evaluate  bootstrap  effects, 
technology  dependence,  and  critical  timing  paths. 

The  processor  design  is  represented  by  10,000 
lines  of  code,  interpreted  by  Earl6,  a  constraint  solv¬ 
ing  composition  and  geometry  tool.  Although  the 


parts  are  composed  in  a  rectangular  bounding  box 
discipline,  the  geometry  internal  to  cells  includes  ar¬ 
bitrary  angles  and  approximations  of  circular  arcs,  a 
form  of  “Boston  geometry*  that  can  be  specified  very 
easily  in  Earl.  This  unusual  layout  style  is  estimated 
to  have  reduced  silicon  area  by  10%  over  45-degree 
angle  geometry,  and  by  about  25%  over  Manhattan 
geometry. 

For  the  design  verification,  the  entire  logic  design 
was  coded  and  simulated  using  the  ternary  switch 
level  simulator  MOSSIM8  to  verify  logical  correct¬ 
ness.  After  the  layout  was  complete,  raster  extrac¬ 
tion  of  layout  using  a  Boston  geometry  circuit  ex¬ 
tractor  produced  a  switch  network  that  was  used  for 
MOSSIM  n®  simulations. 

TESTING 

First  silicon  for  the  Mosaic  processor,  received 
on  9  February  1983,  34  days  after  the  CIF  was  sub¬ 
mitted  to  MOSIS,  was  tested  immediately  and  found 
to  run  code  at  a  7  MHz  clock  rate  at  room  tempera¬ 
ture.  Subsequent  processors  fabricated  using  a  faster 
process  (still  with  a  4  micron  feature  size)  ran  at  up 
to  11  MHz  at  room  temperature. 

Initial  testing  was  accomplished  by  running  the 
same  code  that  bad  been  used  for  switch  level  simula¬ 
tions.  Subsequent  testing  using  more  extensive  test 
programs  discovered  minor  bugs  that  have  been  fixed 
in  subsequent  microcode.  A  scan  path  included  in  the 
original  design  between  the  datapath  and  controller 
was  not  used,  although  it  might  have  been  useful  if 
anything  had  been  seriously  wrong. 

Overall,  our  testing  experiences  have  been  quite 
similar  to  those  reported  by  several  other  university 
groups,  and 'point  to  two  interesting  development 
in  testing  for  design  verification.  First,  verification 
tools  have  advanced  to  the  extent  that  nearly  the 
entire  design  verification  task  is  now  accomplished 
before  first  silicon.  Second,  chips  that  are  systems 
rather  than  components  turn  out  to  be  simpler  to 
test  by  placing  them  in  their  system  environment 
than  in  a  conventional  tester,  and  the  same  tools  that 
are  used  to  program  these  systems  serve  to  develop 
thorough  tests  of  their  function. 

ACKNOWLEDGEMENTS 

Chris  Kingsley  -  Earl,  Mike  Schuster  -  Fsim, 
Howard  Derby  •  early  design,  OM2  &  GMP  •  ideas. 


REFERENCES 

[1]  Sally  A  Browning,  Computations  on  a  Trot  of 
Processors,  Proceedings  of  the  Caltech  Confer¬ 
ence  on  VLSI,  January  1970,  Computer  Science, 
Caltech. 

[2]  Sally  A  Browning,  Hierarchically  Organised  Ma-  ■ 
chines,  Section  8.4  in  Mead  &  Conway4. 

[3]  Sally  A  Browning  and  Charles  L  Seits,  Commu¬ 
nication  in  a  Tree  Machine,  Proceedings  of  the 
Second  Caltech  Conference  on  VLSI,  January 
1081,  Computer  Science,  Caltech. 

[4]  Carver  Mead  and  Lynn  Conway,  Introduction  to 
VLSI  Systems,  Addison- Wesley,  1980. 

[5]  Charles  L  Seits,  Ensemble  Architectures  for  VLSI, 
Proceedings  of  the  MIT  Conference  on  Advanced 
Research  in  VLSI,  January  1982,  Artech  Books, 
1982. 


[B]  Chris  Kingsley,  Earl:  An  Integrated  Circuit  De¬ 
sign  Language,  Technical  Report  5021,  Compu¬ 
ter  Science,  Caltech,  June  1982. 

[7]  Charles  L  Seits,  Experiments  with  VLSI  Ensem¬ 
ble  Machines,  Technical  Report  5102,  Computer 
Science,  Caltech,  October  1983. 

[8]  Randal  E  Bryant,  A  Switch-Level  Model  and  Sim¬ 
ulator  for  MOS  Digital  Systems,  Technical  Re¬ 
port  50B5,  Computer  Science,  Caltech,  January 
1983. 

[9]  R.  Bryant,  M.  Schuster,  D.  Whiting,  MOSSIM 
II:  A  Switch-Level  Simulator  for  MOS  LSI,  User's 
Manual,  Technical  Report  5033,  Computer  Sci¬ 
ence,  Caltech,  March  1982. 


Prototype  Mosaic  Processor 


CONCURRENT  FAULT  SIMULATION 
OF  MOS  DIGITAL  CIRCUITS 

Michael  D.  S charter  end  Randal  E.  Bryant 


California  Inatitote  of  Technology 
Pasadena,  California  01125 


5101:TM:83 


To  be  presented  at  the  Conference  on  Advanced  Research  in  VLSI,  to  be  held  at  the  Massachusetts 
Institute  of  Technology,  January  1084.  Proceedings  published  by  Artech  House,  Inc.,  Dedham,  MA 
02028. 


ABSTRACT 

The  concurrent  fault  smmlation  technique  is  widely  used  to  analyse  the  behavior  of  digital  circuits 
in  the  presence  of  faults.  We  show  how  this  technique  can  be  applied  to  metad-aride-eeimcanductor 
(MOS)  digital  circuits  whan  modeled  at  the  switch-level  as  a  set  of  charge  storage  nodes  connected  by 
bidirectional  transistor  switches.  The  algorithm  we  present  is  capable  of  analysing  the  behavior  of  a  wide 
variety  of  MOS  circuit  failures,  such  as  stuck- st-rero  or  stuck-at-one  nodes,  stack-open  or  stack-dosed 
transistors,  or  resistive  opens  or  shorts.  Wo  have  implemented  a  fault  simulator  FMOSSIM  based  on 
this  algorithm.  The  capabilities  and  the  performance  of  this  program  demonstrate  the  advantages  of 
combining  switch- lerel  and  concurrent  simulation  techniques. 


This  research  was  supported  in  part  by  the  IBM  Corporation  and  by  the  Defense  Advanced  Research 
Contracts  Agency,  ARPA  Order  3771.  Michael  Schuster  was  supported  in  part  by  a  Bell  Laboratories 
Ph.D.  Scholarship. 


©  Artech  House,  1984  , 


CONCURRENT  FAULT  SIMULATION  OF  MOS  DIGITAL  CIRCUITS 


Michael  D.  Schuster  and  Randal  E.  Bryant 


Department  of  Computer  Science 
California  Institute  of  Technology 
Pasadena,  California  91125 


ABSTRACT 

The  concurrent  fault  simnlation  technique  is 
widely  used  to  analyse  the  behavior  of  digital  cir¬ 
cuits  in  the  presence  of  faults.  We  show  how  this 
technique  can  be  appEed  to  metaboxide-semicon- 
doctor  (MOS)  digital  circuits  when  modeled  at  the 
switch-level  as  a  set  of  charge  storage  nodes  con¬ 
nected  by  bidirectional  transistor  switches.  The 
algorithm  we  present  is  capable  of  analysing  the 
behavior  of  a  wide  variety  of  MOS  circuit  faihues, 
such  as  stuck-at-sero  or  stuck-at-one  nodes,  stock- 
open  or  stock-closed  transistors,  or  resistive  opens 
or  shorts.  We  have  implemented  a  fault  simulator 
FMOSSIM  based  on  this  algorithm.  The  capabili¬ 
ties  and  the  performance  of  this  program  demon¬ 
strate  the  advantages  of  combining  switch- level 
and  concurrent  simnlation  techniques. 

INTRODUCTION 

Test  engineers  use  fault  simulators  to  deter¬ 
mine  how  well  a  sequence  of  test  patterns,  when 
applied  to  the  inputs  of  an  integrated  circuit,  can 
distinguish  a  good  chip  from  a  defective  one.  The 
fault  simulator  is  given  a  description  of  the  good 
circuit,  a  set  of  hypothetical  faults  in  the  circuit, 
a  specification  of  the  observation  points  of  the 
test  (e.g.  the  output  pins  of  the  ehip),  and  a  se¬ 
quence  of  test  patterns.  It  then  simulates  how  the 
good  circuit  and  all  of  the  faulty  circuits  would 
behave  when  the  test  patterns  are  applied  to  the 
inputs.  A  fault  is  considered  detected  if  at  any 
time  the  simnlation  of  that  particular  faulty  cir¬ 
cuit  produces,  at  some  observation  point,  a  logic 
value  different  than  that  produced  by  the  good  cir¬ 
cuit.  By  keeping  track  of  which  faults  have  been 
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detected  and  which  have  not,  the  fault  simulator 
can  determine  the  fault  coverage  of  the  test  se¬ 
quence,  which  is  defined  as  the  ratio  of  the  number 
of  faults  detected  to  the  total  number  simulated. 
The  simulator  can  also  provide  the  user  with  infor¬ 
mation  about  which  faults  have  not  been  detected, 
either  because  the  test  sequence  failed  to  exercise 
the  defective  part  of  the  circuit,  or  because  the  se¬ 
quence  failed  to  make  the  effect  of  such  an  exercise 
visible  at  some  observation  point.  This  informa¬ 
tion  guides  the  engineer  in  extending  or  modify¬ 
ing  the  test  sequence  to  improve  its  fault  coverage. 
Such  a  tool  is  invaluable  for  developing  test  pat¬ 
terns  for  today’s  complex  digital  systems. 

For  a  large  integrated  circuit  such  as  a  micro¬ 
processor  chip,  thousands  of  faults  must  be  simu¬ 
lated  to  adequately  characterise  the  fault  coverage 
of  a  test  sequence.  Furthermore,  test  sequences 
can  involve  thousands  of  patterns.  Hence  a  simple 
serial  simulation,  in  which  the  good  circuit  and 
each  faulty  circuit  are  simulated  separately,  would 
require  far  too  much  computation.  Fortunately, 
clever  algorithms  can  reduce  the  amount  of  com¬ 
putation  considerably.  A  technique  known  as  con¬ 
current  timulation1  exploits  the  fact  that  a  faulty 
circuit  typically  differs  only  slightly  from  the  good 
circuit.  Rather  than  simulating  each  circuit  sep¬ 
arately,  only  the  good  circuit  is  simulated  in  its 
entirety.  The  simulator  keeps  track  of  how  the 
network  state  of  each  faulty  circuit  differs  from 
the  network  state  of  the  good  circuit  by  selec¬ 
tively  simulating  portions  of  the  faulty  network. 
To  the  user,  it  appears  as  if  the  program  is  simulat¬ 
ing  many  circuits  concurrently,  but  the  amount  of 
CPU  time  required  is  a  small  factor  (e.g.  often 
less  than  10  times)  greater  than  the  time  required 
to  simnlate  the  good  circuit  alone.  Furthermore, 
the  simulator  can  easily  determine  when  a  faulty 
circuit  produces  a  value  different  than  the  good 
circuit  at  some  observation  print  without  stor- 


ing  the  entire  output  history  of  the  good  circuit 
simulation.  Once  a  fault  has  been  detected,  the 
simulation  of  this  particular  faulty  circuit  can  be 
dropped,  thereby  reducing  the  amount  of  com¬ 
putation  required  for  the  remainder  of  the  Simula- 
tion.  Typically,  the  faults  that  cause  great  differ¬ 
ences  from  the  behavior  of  the  good  circuit,  and 
hence  require  the  most  computational  effort,  are 
detected  quickly.  Consequently,  fault  dropping 
greatly  improves  the  overall  performance  of  the 
simulator. 

Most  existing  logic  simulators  model  a  digi¬ 
tal  circuit  as  a  network  of  logic  gates,  in  which 
each  gate  produces  values  on  its  outputs  based 
on  the  values  applied  to  its  inputs,  and  possibly 
on  the  value  of  its  internal  state.  Some  of  these 
simulators  extend  the  simple  Boolean  gate  model, 
in  which  only  the  value  0  or  1  is  permitted  on 
each  input  and  output,  with  additional  logic  values 
and  special  types  of  gates  to  model  circuit  struc¬ 
tures  such  as  busses  and  pass  transistors.  These 
simulators  are  not  suitable  for  modeling  faults  in 
MOS  digital  circuits  for  two  reasons:  First,  many 
MOS  circuit  structures  cannot  be  adequately  mod¬ 
eled  as  a  set  of  logic  gates.  Creating  gate-level 
descriptions  of  pass  transistor  networks,  dynamic 
memory  elements,  and  precharged  logic  is  at  best 
tedious  and  inaccurate,  and  at  wont  impossible, 
even  with  extended  gate  models.  The  user  must 
translate  the  logic  design  by  hand  into  a  form 
compatible  with  the  simulator,  and  the  resulting 
simulation  is  inherently  biased  toward  the  user’s 
understanding  of  the  functionality  of  the  circuit. 
Second,  logic  gate  simulators  are  especially  poor 
at  predicting  the  behavior  of  a  MOS  circuit  in 
the  presence  of  faults.  Even  simple  logic  gates 
can  become  seemingly  complex  sequential  circuits 
when  a  fault  such  as  an  open-circuited  transis¬ 
tor  occurs?  As  a  result,  fault  simulators  bated 
on  logic  gates  can  model  only  a  limited  class  of 
faults,  such  as  the  gate  outputs  and  inputs  stuek- 
at-iero  or  stuck-at-one.  Faults  such  as  short  cir¬ 
cuits  across  transistors  and  between  wires,  or  open 
circuits  in  transistors  or  wires,  are  beyond  their 
capability.  Furthermore,  even  the  modeling  of 
stuck-at  faults  is  limited  in  accuracy  when  the 
logic  gate  description  is  an  artificial  translation  of 
the  actual  circuit  structure. 

To  remedy  these  problems  with  logic  gate  sim¬ 


ulators,  we  propose  that  fault  simulations  of  MOS 
circuits  be  performed  at  the  twitch  level  with  the 
transistor  structure  of  the  circuit  represented  ex¬ 
plicitly,  but  with  each  transistor  modeled  in  a 
highly  idealised  way.  This  approach  has  proved 
successful  for  logic  simulation  in  programs  such 
as  MOSSIM*  and  MOSSIM  Ilf  because  proper¬ 
ties  such  as  the  bidirectional  nature  of  field-effect 
transistors  and  the  charge  storage  capabilities  of 
the  nodes  in  a  MOS  circuit  are  modeled  directly, 
rather  than  by  some  artificial  translation  into  logic 
gates.  Unlike  the  precise,  but  time-consuming  al¬ 
gorithms  used  by  circuit  simulators,  switch-level 
simulators  model  the  circuit  in  a  sufficiently  simp¬ 
lified  way  that  they  operate  at  speeds  comparable 
with  conventional  logic  gate  simulators.  Further¬ 
more,  our  switch-level  logic  model  is  well  suited  for 
modeling  a  variety  of  failures  in  MOS  circuits  in  a 
reasonably  realistic  way,  because  many  faults  can 
be  viewed  as  creating  new  switch-level  networks 
which  differ  from  the  switch-level  representation 
of  the  good  circuit.  Hence,  while  the  switch-level 
model  has  proved  successful  for  logic  simulation, 
it  seems  especially  attractive  for  fault  simulation. 
Hayes6  has  proposed  the  Connector-Switch-Atten¬ 
uator  representation  of  logic  circuits  for  modeling 
faults,  and  our  switch-level  model  has  essentially 
the  same  capabilities. 

We  have  adapted  the  technique  of  concur¬ 
rent  simulation  to  implement  a  fault  simulator  for 
MOS  circuits,  where  the  problem  is  viewed  as  one 
of  simulating  a  large  number  of  nearly  identical 
switch-level  networks.  This  program  FMOSSIM 
can  simulate  a  large  variety  of  MOS  circuits,  un¬ 
der  a  variety  of  fault  conditions,  at  much  higher 
speeds  than  would  be  possible  with  serial  simula¬ 
tion.  Other  concurrent  fault  simulators  for  MOS 
have  been  implemented?  but  these  could  only  mod¬ 
el  a  very  limited  class  of  networks.  In  this  paper, 
we  will  present  an  overview  of  the  switch-level 
model  and  how  different  faults  can  be  represented 
in  it.  We  also  discuss  our  concurrent,  switch-level 
simulation  algorithm  and  present  performance  re¬ 
sults  from  FMOSSIM. 


The  following  network  model  is  implemented 
in  the  simulators  MOSSIM  II  and  FMOSSIM.  It 
provides  a  more  general  transistor  model  than  pro¬ 
vided  by  other  switch-level  simulators,  giving  bet- 


ter  capabilities  for  fault  injection.  A  switch-level 
network  consists  of  a  set  of  nodet  connected  by  a 
set  of  tranriiton.  Each  node  has  a  state  0,  1,  or 
X,  where  0  and  1  represent  low  and  high  voltages, 
respectively.  The  X  state  represents  an  indeter¬ 
minate  voltage  arising  from  an  uninitialised  node, 
from  a  short  circuit,  or  from  improper  charge  shar¬ 
ing.  No  restrictions  are  placed  on  how  transistors 
are  interconnected. 

Each  node  is  classified  as  either  an  input  node 
or  a  itoragc  node.  An  input  node  provides  a  strong 
signal  to  the  network,  as  does  a  voltage  source  in 
an  electrical  circuit.  Its  state  is  not  affected  by 
the  actions  of  the  network.  Examples  include  the 
power  and  ground  nodes  Vdd  and  Gnd,  which  act 
as  constant  sources  of  the  states  1  and  0,  respec¬ 
tively,  as  well  as  any  clock  or  data  inputs. 

The  state  of  a  storage  node  is  determined 
by  the  operation  of  the  network.  Much  like  a 
capacitor  in  an  electrical  circuit,  a  storage  node 
holds  its  state  in  the  absence  of  connections  to  in¬ 
put  nodes.  To  provide  a  simple  model  of  charge 
sharing,  each  storage  node  is  assigned  a  discrete 
me  from  the  set  {  ,  *1, . . . ,  jcf  },  where  the  sixes 

are  ordered  *i  <  k%  <  •••  <  Kq.  A  larger 
storage  node  is  assumed  to  have  mnch  greater  cap¬ 
acitance  than  a  smaller  one.  When  a  set  of  storage 
nodes  charge  share,  the  states  at  the  largest  nodes 
in  the  set  override  the  states  of  the  smaller  nodes. 
The  number  of  different  sixes  required  ( q )  depends 
on  the  circuit  to  be  simulated.  Most  circuits  can  be 
represented  with  just  two  node  sixes.  In  this  rep¬ 
resentation,  high  capacitance  nodes  such  as  busses 
assigned  sise  /cj,  and  all  other  nodes  are  assigned 
sixe  K\. 

A  transistor  is  a  device  with  terminals  labeled 
gate,  source,  and  drain.  No  distinction  is  made 
between  the  source  and  drain  connections  —  each 
transistor  is  symmetric  and  bidirectional.  Because 
transistors  can  be  either  n-type,  j 9-type,  or  d-type, 
both  nMOS  and  CMOS  circuits  can  be  modeled  A 
d-type  transistor  corresponds  to  a  negative  thres¬ 
hold  depletion  mode  device.  A  transistor  acts  as 
a  resistive  switch  connecting  or  disconnecting  its 
source  and  drain  nodes  according  to  its  type  and 
the  state  of  its  gate  node,  as  shown  in  Figure 
1.  Transistor  states  0  and  1  represent  open  (non¬ 
conducting)  and  closed  (fully  conducting)  condi¬ 
tions,  respectively.  The  X  state  represents  an  in- 
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Figure  1.  Transistor  state  function 

determinate  condition  between  open  and  closed, 
inclusive. 

To  model  the  behavior  of  ratioed  circuits,  each 
transistor  is  assigned  a  discrete  itrength  from  the 
set  { 71 , 7s,  •  • . ,  7p  },  where  strengths  are  ordered 
71  <  7s  <  •  ■  •  <  7p-  A  stronger  transistor  is  as¬ 
sumed  to  have  mnch  greater  conductance  than  a 
weaker  one.  When  a  storage  node  is  connected  to  a 
set  of  input  nodes  by  paths  of  conducting  transis¬ 
tors,  its  resulting  state  depends  only  on  tbe  states 
of  the  input  nodes  connected  by  paths  of  greatest 
strength.  The  strength  of  a  path  is  defined  to 
equal  the  strength  of  the  weakest  transistor  in  the 
path.  The  total  number  of  strengths  required  (p) 
depends  on  the  circuit  to  be  modeled.  Most  CMOS 
circuits  do  not  utilise  ratioed  logic  and  hence  can 
be  modeled  with  just  one  transistor  strength.  Most 
nMOS  circuits  require  only  two  strengths,  with 
pull-up  loads  assigned  strength  71  and  all  other 
transistors  assigned  strength  7s. 


Figure  S.  Three  transistor  dynamic  RAM 

As  an  example  of  a  switch-level  network,  con¬ 
sider  tbe  three  transistor  dynamic  RAM  circuit 
shown  in  Figure  2.  The  bus  node  has  sixe  ««  to 
indicate  that  it  can  supply  its  state  to  the  sise 
storage  node  (mi  or  ms)  of  the  selected  memory 
element  during  a  write  operation  (when  wi  or  w* 
is  1)  and  to  the  sixe  *1  drain  node  («i  or  cs)  of  the 
selected  storage  transistor  during  a  read  operation 
(when  ri  or  rs  is  1).  The  d-type  pull-up  transis¬ 
tor  in  the  input  inverter  has  strength  71 ,  to  indi¬ 
cate  that  it  can  drive  the  bus  high  only  when  the 


strength  is  pull-down  transistor  is  not  conduct¬ 
ing.  The  strengths  of  all  other  transistors  in  the 
circuit  are  arbitrary,  since  they  are  not  involved 
in  ratioed  path  formation  (except  possibly  when 
faults  are  present). 

The  switch-level  network  model  strikes  a  rea¬ 
sonable  balance  between  a  detailed  electrical  mod¬ 
el  and  an  abstract  logical  model.  As  a  result  of  this 
abstraction,  the  model  may  not  predict  the  true 
behavior  of  circuits  such  as  sense  amplifiers  and 
arbiters  which  rely  on  detailed  analog  properties. 
Moreover,  the  network  model  does  not  contain 
enough  detail  to  accurately  model  timing  behavior, 
because  even  in  circuits  with  straightforward  logi¬ 
cal  behavior,  timing  can  be  subtle.  However,  ex¬ 
perience  has  shown  that  switch-level  simulation 
works  quite  well  for  verifying  logic  designs. 


Faults  are  represented  in  FMOSSIM  as  though 
extra  fault  tranriitori  were  added  to  the  network, 
much  lit*  that  proposed  by  Lightner  and  Hachtel! 
In  the  implementation,  however,  many  of  these 
faults  are  injected  without  actually  adding  fault 
transistors;  nevertheless,  the  behavior  is  equivalent 
to  what  is  described  below.  The  gate  nodes  of  the 
fault  transistors  are  considered  to  be  extra  /omit 
inputi  to  the  network  that  control  the  presence  or 
absence  of  the  failures.  A  variety  of  MOS  failures 
can  be  modeled  with  this  method.  For  example, 
a  short  circuit  between  two  nodes  is  modeled  by 
connecting  the  nodes  with  a  fault  transistor  that 
ia  open  in  the  good  circuit  and  closed  in  the  faulty 
circuit.  Similarly,  an  open  circuit  is  modeled  by 
splitting  a  node  into  two  parts  and  connecting 
the  resulting  nodes  with  a  fault  transistor  that  is 
closed  in  the  good  circuit  and  open  in  the  faulty 
circuit.  By  adjusting  the  strength  o t  the  fault 
transistor,  the  resistance  of  the  short  or  open  may 
be  modeled  in  an  approximate  way.  For  example, 
if  the  strength  of  the  fault  transistor  is  set  to  Vk 
(Le.  a  strength  greater  than  that  of  any  normal 
transistor),  then  setting  this  transistor  state  to  1 
shorts  the  source  and  drain  nodes  together  such 
that  they  act  as  a  single  node.  Moreover,  because 
the  state  of  each  fault  transistor  can  be  controlled 
independently,  both  single  and  multiple  faults  can 
be  injected. 

Figure  3  illustrates  the  use  of  fault  transistors 
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Figure  3.  Modeling  MOS  failures 

to  create  a  variety  of  circuit  faults.  Those  transis¬ 
tors  with  gate  nodes  labeled  /  are  normally  0,  but 
are  set  to  1  to  create  the  fault;  the  transistors  with 
gate  nodes  labeled  J  are  normally  1,  but  are  set 
to  0  to  create  the  fault.  A  stuck-at-sero  or  stuck- 
at-one  node  fanlt  can  he  modeled  by  inserting  a 
strength  7^+1  fanlt  transistor  to  short  the  node  to 
Gnd  or  to  Vdd,  respectively.  A  stuck-closed  tran¬ 
sistor  fanlt  is  injected  by  shorting  the  transistor’s 
source  and  drain  together  with  a  fanlt  transistor 
whose  strength  equals  that  of  the  failing  transis¬ 
tor.  Similarly,  a  stack-open  transistor  fault  is 
modeled  by  patting  a  fault  transistor  in  series  with 
it.  In  FMOSSIM,  both  stuck-at  node  states  and 
stuck-at  transistor  states  are  implemented  without 
extra  fault  transistors,  while  other  faults  require 
additional  transistors  to  be  inserted  into  the  net¬ 
work. 


Although  FMOSSIM  can  model  a  larger  class 
of  faults  than  can  be  modeled  by  logic  gate  fanlt 
simulators,  it  still  provides  only  a  simplified  rep¬ 
resentation  of  the  faulty  circuit.  For  example, 
the  effects  of  manufacturing  defects  such  as  in¬ 
correct  transistor  thresholds,  pinholes  in  the  gate 
oxides,  and  variations  in  the  circuit  delays,  cannot 
be  described  accurately.  The  effects  of  resistive 


shorts  and  opens  can  only  be  approximated.  In 
fact,  even  existing  circuit  simulators  cannot  model 
defects  that  change  the  basic  nature  of  the  devices, 
such  as  pinholes  in  the  gate  oxides.  However,  even 
if  the  fault  models  supported  by  our  simulator 
do  not  exactly  match  the  failure  modes  in  actual 
chips,  the  program  can  still  help  the  designer  in 
developing  a  set  of  test  patterns.  For  circuits  imp* 
lemented  in  bipolar  technologies  such  as  TTL,  ex* 
perience  has  shown  that  a  test  sequence  that  yields 
a  high  level  of  coverage  for  single  stuck-at*sero 
and  stuck*at*one  faults  in  the  lope  gate  network 
generally  provides  a  good  test  of  the  circuit.  It 
seems  reasonable  to  expect  that  the  test  coverage 
measured  by  a  switch-level  fault  simulator  for  an 
idealised  set  of  faults  should  reliably  predict  how 
well  the  test  sequence  will  work  on  a  MOS  circuit. 
Such  a  conjecture,  however,  can  only  be  confirmed 
by  actual  experience  in  a  manufacturing  environ¬ 
ment. 

Many  faults  in  our  model  have  the  effect  of 
creating  an  X  state  on  a  node  when  the  good  circuit 
has  a  0  or  1.  For  example,  if  the  control  signal 
wi  in  the  circuit  shown  in  Figure  2  is  stuck- at- 
sero,  bit  mi  of  the  memory  will  never  be  initialised 
and  will  remain  at  X.  On  the  other  hand,  if  the 
precharge  clock  dptr  is  stuck-at-one,  any  time  we 
try  to  read  a  1  value  out  of  a  memory  cell,  a 
short  circuit  will  develop  between  Vdd  and  Gnd 
giving  an  X  on  the  bus.  Whether  or  not  such  X’s 
would  be  detected  in  an  actual  test  depends  on 
detailed  characteristics  of  the  circuit  that  cannot 
be  predicted  at  the  switch- level,  such  as  the  initial 
voltages  of  dynamic  nodes,  how  the  voltage  would 
divide  across  a  shorting  path,  and  the  thresholds 
of  the  devices  sensing  these  X  values.  On  one 
hand,  a  pessimist  might  argue  that  an  X  in  a  faulty 
circuit  should  be  considered  undetectable,  because 
there  is  no  guarantee  that  the  X  will  produce  an 
effect  different  than  the  state  of  the  node  in  the 
good  circuit.  On  the  other  hand,  a  fault  that 
prevents  the  circuit  from  being  initialised,  such  as 
a  stuck-at-sero  clock  line,  would  clearly  be  quickly 
detected.  As  a  compromise  FMOSSIM  allows  the 
user  to  specify  a  soft  detect  limit  l  such  that  if  in 
the  good  circuit  some  output  changes  both  to  1 
and  to  0  at  least  l  times  each,  while  the  output 
in  a  faulty  circuit  remains  at  X,  then  this  fault  is 
considered  detected.  This  approach  seems  to  work 
reasonably  well  in  practice. 


BEHAVIORAL  MODEL 

The  operation  of  a  MOS  circuit  is  charac¬ 
terised  in  the  switch-level  model  in  terms  of  its 
steady  state  response  function f’9  which  can  best 
be  explained  in  terms  of  an  analogy  to  electrical 
networks.  A  MOS  transistor  behaves  as  a  voltage- 
controlled,  nonlinear  resistor  where  the  voltages  of 
its  gate,  source  and  drain  nodes  control  the  resis¬ 
tance  between  its  source  and  drain.  Suppose  in 
a  transistor  circuit  we  could  control  the  transis¬ 
tor  resistances  independently  of  the  node  voltages. 
For  a  given  setting  of  the  transistor  resistances, 
such  a  circuit  acts  as  a  network  of  passive  dements 
which,  for  a  given  set  of  initial  node  voltages,  has 
a  unique  set  of  steady  state  node  voltages.  Thus  a 
function  that  maps  transistor  resistances  and  ini¬ 
tial  node  voltages  to  steady  state  node  voltages 
gives  a  partial  characterisation  of  the  behavior  of  a 
transistor  circuit.  The  steady  state  response  func¬ 
tion  provides  just  this  sort  of  characterisation,  but 
in  terms  of  node  and  transistor  states  0,  1,  and  X. 
That  is,  for  a  given  set  of  ™tial  node  and  tran¬ 
sistor  states,  the  steady  state  response  function 
yields  the  set  of  states  which  the  storage  nodes 
would  eventually  reach  if  all  transistors  were  held 
fixed  in  their  initial  states.  This  function  only 
approximates  network  behavior,  since  it  does  not 
describe  the  rate  at  which  nodes  approach  their 
steady  states  nor  the  effects  of  the  changing  tran¬ 
sistor  states  as  their  gate  nodes  change  state. 

In  general,  a  switch-level  network  may  con¬ 
tain  nodes  and  transistors  in  the  X  state.  Such 
states  arise  from  improper  charge  sharing  or  (tran¬ 
sient)  short  circuits  even  in  properly  designed  net¬ 
works.  The  behavior  of  a  network  in  the  presence 
of  X  states  must  be  described  in  a  way  that  is 
neither  overly  optimistic  (Le.  ignoring  possible  er¬ 
ror  conditions),  nor  overly  pessimistic  (i.e.  spread¬ 
ing  X’s  beyond  the  region  of  indeterminate  be¬ 
havior).  This  can  be  accomplished  by  defining  the 
steady  state  response  of  a  node  to  be  0  or  1  if 
and  only  if  the  node  would  have  this  unique  state 
regardless  of  whether  each  node  and  transistor  in 
the  X  state  had  state  0  or  1;  otherwise,  the  steady 
state  of  the  node  is  defined  to  be  X  Rather  than 
computing  the  steady  state  for  all  possible  com¬ 
binations  of  the  nodes  and  transistors  in  the  X 
state  set  to  0  or  1  (a  task  of  exponential  com¬ 
plexity),  an  equivalent  two-pass  linear  time  algo- 


rithm  is  used?  Each  pass  involves  idling  a  set  of 
equations  expressed  in  a  simple,  discrete  algebra 
using  a  relaxation  algorithm. 

Given  a  technique  for  computing  the  steady 
state  response  function,  a  switch-level  logic  simu¬ 
lator  can  be  implemented  that  simnlates  the  operas 
tion  of  the  network  by  repeatedly  performing  unit 
steps  until  a  stable  state  is  reached.  Each  unit 
step  involves  computing  the  steady  state  response 
of  the  network,  setting  the  storage  nodes  to  these 
values,  and  setting  the  transistors  according  to  the 
states  of  their  gate  nodes.  This  simulation  tech¬ 
nique  implements  a  timing  model  in  whieh  tran¬ 
sistors  switch  one  time  unit  (Le.  one  evaluation 
of  the  steady  state  response  function)  after  their 
gate  nodes  change  state.  Such  a  timing  model 
tells  little  about  the  speed  of  a  circuit  but  usually 
suffices  to  describe  the  circuit’s  logical  behavior. 
As  with  other  unit  delay  simulations,  this  com¬ 
putation  may  not  reach  a  stable  condition  due 
to  oscillations  in  the  circuit,  and  hence  an  up¬ 
per  bound  must  be  placed  on  the  number  of  steps 
simulated. 

On  a  given  unit  step,  often  only  a  small  por¬ 
tion  of  the  network  changes  state,  while  the  rest 
of  the  network  remains  inactive.  Most  logic  sim¬ 
ulators  exploit  this  property  by  recomputing  the 
output  of  a  logic  gate  only  if  *1  least  one  of  the 
gate’s  inputs  has  changed  state.  A  similar  effect  is 
achieved  in  switch-level  networks  by  viewing  net¬ 
work  activity  as  creating  small  perterbationi  of 
the  network  state,  and  only  computing  the  effects 
of  these  perturbations  incrementally  rather  than 
recomputing  the  state  of  the  entire  network.  We 
say  that  a  storage  node  is  perturbed  if  it  is  the 
source  or  drain  of  a  transistor  that  has  changed 
state,  or  if  it  is  connected  by  a  transistor  in  the 
1  or  X  state  to  an  input  node  that  has  changed 
state.  Such  a  perturbation  can  only  affect  storage 
nodes  in  the  vicinity  of  the  perturbed  node,  where 
two  nodes  are  in  the  same  vicinity  if  and  only  if 
there  exists  some  path  of  transistors  in  the  1  or 
X  state  between  the  nodes  which  does  not  pass 
through  any  input  nodes.  This  definition  exploits 
the  dynamic  locality  in  the  network  where  the 
source  and  drain  of  a  transistor  in  the  0  state  are 
considered  to  be  electrically  isolated.  Typically, 
a  vicinity  contains  only  a  few  nodes,  end  hence 
activity  remains  highly  localised. 
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Fignre  4.  Implementation  of  unit  step 

Figure  4  shows  a  simplified  implementation  of 
the  unit  step  operation  that  uses  this  incremental 
perturbation  technique  to  recompute  only  selected 
parts  of  the  network  state.  The  argument  P  is  a 
set  of  perturbed  storage  nodes  derived  from  either 
new  data  and  clock  inputs  to  the  circuit  or  from 
the  last  unit  step.  For  each  of  these  nodes,  update- 
vicinity  finds  all  the  nodes  in  the  same  vicinity, 
computes  their  steady  state  response,  and  returns 
a  set  of  nodes  that  changed  state.  These  updated 
nodes  are  accumulated  in  the  set  U.  Vicinities  are 
found  by  a  depth  first  search10  originating  at  the 
perturbed  node  and  tracing  outward  through  tran¬ 
sistors  in  the  1  or  X  state  from  source  to  drain  un¬ 
til  an  input  node  is  encountered.  As  each  node  is 
added  to  the  vicinity,  it  is  flagged  to  avoid  duplicar 
tion  and  endless  cycles.  For  each  updated  node  n 
in  U,  perfurb-fraastsfors  finds  all  transistors  whose 
gate  node  is  n  and  checks  to  see  if  they  have 
changed  state.  Nodes  perturbed  by  these  chang¬ 
ing  transistor  states  are  accumulated  in  a  new  set 
P  in  preparation  for  the  next  unit  step.  Finally, 
update-node  sets  each  updated  node  to  its  new 
state. 

CONCURRENT  SIMULATION 

We  have  seen  that  the  presence  or  absence  of  a 
fault  in  a  switch-level  network  is  controlled  by  the 
state  of  a  fault  input  node.  Suppose  the  test  pat¬ 
terns  that  specify  data  and  dock  input  values  are 
extended  to  include  values  for  the  network’s  fault 
input  nodes.  Then  the  behavior  of  a  set  of  faulty 
circuits  can  be  determined  by  repeatedly  simulat¬ 
ing  patterns  that  differ  only  in  selected  fault  in¬ 
put  values.  Hence,  concurrent  fault  simulation  can 
be  viewed  as  the  problem  of  efficiently  applying  a 
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large  number  of  nearly  identical  test  sequences  to 
a  single  network.  This  viewpoint  separates  issues 
of  fanlt  modeling  from  concurrent  simulation.  For 
example,  since  values  for  fault  input  nodes  are 
specified  on  an  individual  pattern  by  pattern  basis, 
multiple  and  intermittent  faults  are  easily  modeled 
without  changing  the  basic  simulation  algorithm. 
Furtheimore,  there  are  no  inherent  restrictions  re* 
quiring  that  the  data  inputs  of  all  test  sequences 
be  identical.  Thus,  concurrent  simulation  is  useful 
not  only  for  simulating  faults,  but  for  simulating 
sets  of  similar  test  patterns  on  a  fault-free  circuit. 

The  concurrent  simulation  algorithm  is  given 
a  description  of  the  network  and  a  set  of  teit  se¬ 
quences  T  =  { to, . . . ,  tn  }•  A  test  sequence  tf  € 
T  consists  of  a  sequence  of  test  patterns,  each 
specifying  values  for  the  data,  clock  and  fanlt  in¬ 
puts  of  the  network.  The  algorithm  simulates  the 
network  to  determine  how  each  node  behaves  for 
each  test  sequence  tf.  That  is,  at  any  point  dur¬ 
ing  the  simulation,  each  node’s  state  if  in  test  se¬ 
quence  t{  is  found.  Since  we  assume  that  the  be¬ 
havior  of  the  network  differs  only  slightly  from  test 
sequence  to  test  sequence,  «f  =  «g  for  most  nodes 
ir  the  network.  This  observation  is  exploited  by 
representing  node  states  compactly  as  a  set  of  pairs 
5  =  { (tf,  mi)  },  called  a  state  tet,  where  (t{,  #*)  €  S 
if  and  only  if  i  =  0  or  it  so.  The  behavior  of  the 
network  for  test  sequence  to  serves  as  a  reference 
point,  since  states  are  explicitly  stored  only  for  test 
sequence  t0  and  those  sequences  tf  whose  states 
differ  from  to.  For  this  reason,  test  sequence  to 
is  called  the  reference  sequence.  For  fault  simula¬ 
tion,  the  reference  sequence  corresponds  to  the 
good  circuit,  while  test  sequences  tf,i  7^  0  differ 
only  in  selected  fault  input  values,  and  hence  cor¬ 
respond  to  faulty  circuits.  A  node  is  said  to  be 
diverged  for  tf  if  if  ^  sq.  A  node  is  said  to  be 
diverged  if  it  is  diverged  for  any  tf.  If  the  gate 
node  of  a  transistor  is  diverged,  then  the  transis¬ 
tor  itself  is  said  to  be  diverged. 

If  a  node  is  perturbed  due  to  an  input  node 
or  transistor  changing  state  for  the  reference  se¬ 
quence,  it  is  likely  that  the  node  is  also  perturbed 
for  most  other  test  sequences  tf.  We  exploit  this 
observation  by  maintaining  a  set  of  perturbations 
of  the  form  P  —  {(n/,t/)},  called  the  perturba¬ 
tion  set,  where  (nj,  to)  €  P  if  and  only  if  node 
nj  is  perturbed  for  the  reference  sequence  to  and 


(nj,  U)  6  P,i  7^  0,  if  and  only  if  n/  is  perturbed 
for  t;  but  not  for  the  reference  sequence.  The  per¬ 
turbation  (nj,  tf)  €  P ,  where  t  7^  0,  indicates  that 
the  network  has  behaved  differently  in  the  area 
near  node  nj  for  test  sequence  tf  when  compared 
to  its  behavior  for  the  reference  sequence. 

As  described  above,  each  unit  step  of  the  con¬ 
ventional  switch-level  simulation  algorithm  com¬ 
putes  a  steady  state  response  for  each  node  in  the 
vicinity  of  a  perturbed  node,  updates  those  nodes 
that  have  new  steady  states,  and  returns  a  set  of 
perturbations  for  the  next  unit  step.  To  general¬ 
ise  this  operation  for  concurrent  simulation,  ob¬ 
serve  that  the  perturbation  (nj,to)  6  P  repre¬ 
sents  a  perturbation  not  only  for  the  reference  se¬ 
quence,  but  likely  for  most  other  test  sequences. 
In  general,  the  steady  state  response  of  nodes  in  a 
vicinity  is  a  function  of  both  their  initial  states  as 
well  as  the  states  of  the  transistors  whose  source 
or  drain  node  is  in  the  vicinity.  Thus,  when  the 
steady  state  response  is  computed  for  the  nodes  in 
some  vicinity  as  a  result  of  a  perturbation  for  the 
reference  sequence,  we  must  check  to  see  if  any 
of  the  nodes  or  transistors  are  diverged.  We  ex¬ 
pect  that  most  of  the  time,  for  most  test  sequences 
tf,  nodes  within  the  vicinity  will  not  be  diverged 
for  tf.  In  this  case,  the  steady  state  response 
computation  performed  for  the  reference  sequence 
will  be  valid  for  tf,  and  hence  there  is  no  need 
to  duplicate  this  computation  for  tf.  However,  if 
some  node  nj  within  the  vicinity  is  diverged  for 
tf,  then  the  steady  state  response  computation  us¬ 
ing  the  states  of  the  nodes  and  transistors  for  the 
reference  sequence  may  not  be  valid  for  tf.  To 
guarantee  that  an  accurate  computation  be  per¬ 
formed  for  tf,  the  perturbation  (nj,t f)  is  added 
to  P.  In  effect,  we  are  simply  scheduling  a  steady 
state  response  computation  that  will  be  performed 
sometime  later.  Diverged  transistors  are  handled 
in  a  dtniW  manner,  for  if  some  transistor  with 
source  n,  or  drain  ns  in  the  vicinity  is  found  to  be 
diverged  for  tf,  that  the  perturbations  (n„  tf)  and 
(m,  tf)  are  added  to  P. 

To  determine  the  steady  state  response  for 
nodes  in  the  vicinity  of  a  perturbation  (nj,  tf), 
where  »  ^  0,  states  of  the  nodes  and  transistors 
for  test  sequence  tf  must  be  found.  This  involves 
searching  node  state  sets  5  for  elements  of  the 
form  (sf,  tf).  If  such  an  element  is  not  found,  then 
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V  dd  =  mi  =  mg  —  data  =  { (to,  l) } 

Gnd  =  dpt  =  din  =  data  =  { (to,  0) } 
r2  =  un  =  w*  =  ci  =  c2  =  { (to,  0) } 

6«#  =  {(to,l),(ti,0)} 

ri  =/,  =  {<to,o),(t,,i)} 

/,  =  {(to,0),(t*,l>} 

Figure  5.  Initial  Node  States 

the  state  for  the  reference  sequence  is  used.  To 
reduce  search  time,  elements  in  both  the  state  sets 
S  and  perturbation  set  P  are  kept  sorted  by  test 
sequence. 

As  an  example  of  this  simulation  technique, 
consider  the  circuit  shown  in  Figure  2.  An  opera¬ 
tion  that  sets  node  m2  to  0  will  be  described. 
Suppose  initially  that  nodes  Vdd,  m\,  mg,  but, 
and  data  hare  state  1  and  all  other  nodes  hare 
state  0.  Two  fault  transistors  are  added  to  the 
network,  one  connecting  node  mg  t o  Vdd  whose 
gate  is  fault  input  fi ,  the  other  connecting  node 
fi  to  Vdd  whose  gate  is  /g.  For  the  reference  se¬ 
quence,  both  of  these  fault  input  nodes  hare  state 
0  so  that  the  faults  are  absent.  For  test  sequence 
t\,  fx  has  state  1  to  inject  fault  ri  stuck-at-one. 
For  test  sequence  tg,  /g  has  state  1  to  inject  fault 
mg  stuck-at-one.  Due  to  the  fault  injected  by  ti, 
but  and  Gnd  are  connected  by  conducting  tran¬ 
sistors,  hence  but  is  initially  0  for  The  repre¬ 
sentation  of  these  initial  states  is  shown  in  Figure 
5. 

To  set  mg  to  0,  nodes  din  and  wg  must  be  set 
to  1.  These  changes  perturb  6us,  data,  and  mg, 
since  they  are  connected  to  the  source  or  drain  of 
transistors  that  hare  changed  state.  The  vicinity 
for  each  of  these  perturbed  nodes  contains  but, 
data,  mg,  Vdd,  and  Gnd,  and  to  their  steady  state 
responses  are  determined.  All  three  storage  nodes 
have  steady  states  0  due  to  the  connection  to  Gnd 
through  transistors  whose  gates  are  din  and  data. 
The  states  of  Vdd  and  Gnd  are  unchanged  since 
they  are  inpot  nodes.  Notice  that  the  pull-up 
connection  between  data  and  Vdd  has  no  effect  on 
the  steady  state  of  data  since  the  strength  of  this 
connection,  which  is  71,  is  less  than  the  strength 
7s  pull-down  connection  between  data  and  Gnd. 

The  steady  state  computation  just  described 
was  performed  relative  to  the  reference  sequence, 
since  node  states  for  the  reference  sequence  were 


Vdd  =  m\  =  din  =  wt  =  data  —  {(to,  l) } 
Gnd  =  dpt  —  data  =  { (to,  0) } 
rg  =  W|  =  ci  =  eg  =  { (to,0) } 

6us=*{(t0,0),(tg,X)} 

n  =/i  =={(to,0),(t1(i>} 

mg  =  /g  =  {(t0,0>,(tgll)} 

Figure  ft.  Final  Node  States 

used  to  determine  which  nodes  were  within  the 
vicinity  as  well  as  their  steady  state  responses. 
This  computation  may  be  invalid  for  sequences  tt 
or  tg  since  but  has  state  0  tor  tg  and  mg  is  con¬ 
nected  to  a  conducting  fault  transistor  for  1 1 .  So 
that  the  appropriate  steady  state  response  com¬ 
putations  will  be  performed  for  both  tj  and  tg,  the 
perturbations  ( but,tt )  and  (mg.tg)  are  generated 
as  the  vicinity  is  found. 

Consider  the  effects  of  perturbation  (mg,  tg). 
A  vicinity  containing  but,  data,  m%,  V dd,  and  Gnd 
is  found,  as  in  the  simulation  for  the  reference  se¬ 
quence.  The  steady  state  response  of  but  depends 
on  the  strengths  of  the  transistors  whose  gates 
are  din  and  tog.  If  both  of  these  transistors  have 
strength  71 ,  data  stays  0  but  but  becomes  X  due 
to  the  short  between  Gnd  and  Vdd  through  the 
fault  transistor  connected  to  mg. 

Now  consider  the  effects  of  the  perturbation 
(6«s,  ti).  In  this  case,  the  vicinity  contains  node 
ci ,  in  addition  to  those  found  above.  The  short 
between  but  and  Gnd  has  no  effect,  and  ci,  mg, 
but,  and  data  all  have  steady  state  responses  equal 
to  those  in  the  good  circuit.  The  representation  of 
the  final  node  states  is  shown  in  Figure  6. 

In  this  example,  we  have  seen  that  faults  may 
affect  the  steady  state  response  of  nodes  as  well 
as  which  nodes  are  contained  within  a  vicinity. 
By  explicitly  generating  perturbations  for  diverged 
nodes  and  transistors  when  a  vicinity  in  the  good 
circuit  is  simulated,  we  exploit  the  locality  of  ac¬ 
tivity  in  each  faulty  circuit  independent  of  ac¬ 
tivity  in  other  circuits.  Furthermore,  this  tech¬ 
nique  selectively  simulates  only  differing  portions 
of  a  faulty  circuit,  and  hence  simulation  proceeds 
quickly. 

PERFORMANCE  RESULTS 

As  a  test  case  for  evaluating  the  performance 
of  FMOSSIM,  we  simulated  a  04  bit  dynamic  RAM 


-  8  - 


Figure  7.  Performance  on  Memory  Circuit 

circuit  containing  374  transistors.  This  circuit  in* 
corporates  a  variety  of  MOS  structures  such  as 
logic  gates,  bidirectional  pass  transistors,  dynamic 
latches,  precharged  basses,  and  three-transistor 
dynamic  memory  elements.  The  circuit  was  simu¬ 
lated  with  428  tanks  —  each  storage  node  stuck- 
st-iero,  each  storage  node  stuck-at-one,  and  pairs 
of  adjacent  busses  shorted  together.  To  validate 
the  program,  we  also  simulated  other  faults,  in¬ 
cluding  stack-open  and  stack-dosed  transistors. 
The  simulator  was  implemented  in  the  Mainsail 
programming  language,1 1  and  run  on  aDEC- 20/60. 

Figure  7  illustrates  the  performance  of  FMOS- 
SIM  when  simulating  a  test  sequence  consisting  of 
a  marching  test1*  of  the  memory,  together  with 
special  tests  for  the  control  logic.  The  curve  climb¬ 
ing  diagonally  upward  indicates  the  total  number 
of  faults  detected  as  the  test  progresses.  AH  faults 
were  detected  after  407  patterns.  The  falling  curve 
indicates  the  CPU  time  required  to  each 

pattern.  This  time  starts  at  27  seconds  when  the 
circuits  are  initialised.  After  100  patterns,  it  drops 
to  around  1  second  as  faults  were  detected  and  the 
simulations  of  these  circuits  were  dropped.  This 
time  finally  reaching  0.3  seconds  at  the  end  of  the 
simulation,  when  only  the  good  circuit  is  being 
simulated. 


Figure  8.  Effective  Concurrency 

Figure  8  illustrates  the  performance  advan¬ 
tage  of  concurrent  simulation  over  simulating  each 
faulty  circuit  separately.  The  curve  falling  diagon¬ 
ally  to  the  right  indicates  the  number  of  circuits 
being  simulated  as  the  test  proceeds.  The  other 
curve  indicates  the  CPU  time  required  to  simulate 
each  pattern  divided  by  the  number  of  circuits  be¬ 
ing  simulated  for  that  pattern.  This  curve  starts 
at  about  0.06  seconds  per  pattern,  drops  to  a  low 
of  0.006  seconds  once  those  faults  causing  major 
differences  from  the  good  circuit  are  dropped,  and 
finally  climbs  back  to  0.3  seconds  when  only  the 
good  circuit  is  being  simulated.  Considering  that 
simulating  a  single  circuit  requires  about  0.3  sec¬ 
onds  per  pattern,  the  effective  benefit  of  simulat¬ 
ing  all  of  the  circuits  concurrently  starts  at  6  times 
serial  simulation,  rises  to  00  times,  and  drops  back 
down  to  1. 


Over  the  entire  test  sequence,  simulating  the 
good  machine  alone  requires  2.5  CPU  minutes. 
Our  fault  simulation  requires  11  CPU  minutes, 
whereas  simulating  each  faulty  circuit  serially  un¬ 
til  it  produces  a  different  result  than  the  good  cir¬ 
cuit  would  take  almost  6  hours.  Thus,  in  this 
case,  concurrent  simulation  has  a  thirty-fold  net 
advantage  over  serial  simulation.  Such  a  perfor¬ 
mance  gain  is  clearly  worth  the  effort. 


Oar  experience  with  FMOSSIM  has  shown 
that  it  is  a  very  useful  tool  for  developing  test  se¬ 
quences.  Even  when  developing  a  test  for  a  small 
section  of  an  integrated  circuit  (such  as  an  ALU 
or  a  register  array),  the  fault  simulator  provides 
information  that  is  hard  to  obtain  by  any  other 
means.  It  quickly  directs  the  designer  to  those 
areas  of  the  circuit  that  require  further  tests.  For 
example,  in  developing  test  sequences  for  the  mem¬ 
ory  design  described  previously,  we  discovered  that 
a  simple  marching  test  provided  high  coverage  in 
the  memory  array  itself,  but  that  testing  the  con¬ 
trol  logic  and  peripheral  circuits  such  as  the  input 
and  output  latches  was  more  difficult. 

It  remains  to  be  seen  how  the  performance 
characteristics  of  FMOSSIM  will  vary  as  the  sire 
of  the  circuit  and  the  number  of  faults  to  be  simu¬ 
lated  grows  large.  Even  if  it  becomes  impractical 
to  run  full-chip  fault  simulations  with  large  num¬ 
bers  of  faults,  the  program  could  still  produce  use¬ 
ful  results  by  simulating  portions  of  the  chip,  by 
eliminating  faults  that  produce  effects  identical  to 
other  faults,  or  by  simulating  only  a  subset  of  the 
possible  faults  selected  at  random. 
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